[ https://issues.apache.org/jira/browse/SPARK-8169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578444#comment-14578444 ]
yuhao yang commented on SPARK-8169: ----------------------------------- This looks useful. I'd like to give it a try if no one has started on this. And I think there could be more transformers regarding to text pre-processing. Like the text vectorization in LDA example and low-frequency filter. Some rough ideas: The default stop words will probably contains English only, yet the StopWordsRemover should support ASCII. Case sensitivity will be a parameter. Let me know if I'm missing some requirement. > Add StopWordsRemover as a transformer > ------------------------------------- > > Key: SPARK-8169 > URL: https://issues.apache.org/jira/browse/SPARK-8169 > Project: Spark > Issue Type: New Feature > Components: ML > Affects Versions: 1.5.0 > Reporter: Xiangrui Meng > > StopWordsRemover takes a string array column and outputs a string array > column with all defined stop words removed. The transformer should also come > with a standard set of stop words as default. > {code} > val stopWords = new StopWordsRemover() > .setInputCol("words") > .setOutputCol("cleanWords") > .setStopWords(Array(...)) // optional > val output = stopWords.transform(df) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org