[ 
https://issues.apache.org/jira/browse/SPARK-8703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14609703#comment-14609703
 ] 

yuhao yang commented on SPARK-8703:
-----------------------------------

Thanks Joseph. 

It's true that CountVectorizer and HashingTF share similar input and output, 
yet currently CountVectorizer does not actually inherit anything useful from 
HashingTF. And I kind of like the current clean separation among the feature 
transformers. I'm prone to undo the extension.

About code reuse, given HashingTF is invoking the version in mllib and the fact 
that it's a quite straightforward implementation, it may not be necessary to do 
any refactor for code reuse.

[~viirya] and [~fliang]. Thanks for your opinions and I'd like to know your 
thoughts about it.

> Add CountVectorizer as a ml transformer to convert document to words count 
> vector
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-8703
>                 URL: https://issues.apache.org/jira/browse/SPARK-8703
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: yuhao yang
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Converts a text document to a sparse vector of token counts. Similar to 
> http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
> I can further add an estimator to extract vocabulary from corpus if that's 
> appropriate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to