[ https://issues.apache.org/jira/browse/SPARK-8703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14609703#comment-14609703 ]
yuhao yang commented on SPARK-8703: ----------------------------------- Thanks Joseph. It's true that CountVectorizer and HashingTF share similar input and output, yet currently CountVectorizer does not actually inherit anything useful from HashingTF. And I kind of like the current clean separation among the feature transformers. I'm prone to undo the extension. About code reuse, given HashingTF is invoking the version in mllib and the fact that it's a quite straightforward implementation, it may not be necessary to do any refactor for code reuse. [~viirya] and [~fliang]. Thanks for your opinions and I'd like to know your thoughts about it. > Add CountVectorizer as a ml transformer to convert document to words count > vector > --------------------------------------------------------------------------------- > > Key: SPARK-8703 > URL: https://issues.apache.org/jira/browse/SPARK-8703 > Project: Spark > Issue Type: New Feature > Components: ML > Reporter: yuhao yang > Original Estimate: 24h > Remaining Estimate: 24h > > Converts a text document to a sparse vector of token counts. Similar to > http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html > I can further add an estimator to extract vocabulary from corpus if that's > appropriate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org