[ https://issues.apache.org/jira/browse/SPARK-9578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736695#comment-14736695 ]
yuhao yang commented on SPARK-9578: ----------------------------------- A better choice for LDA seems to be lemmatization. Yet that requires pos tags and extra vocabulary. If there's no other ongoing effort on this, I'd like to start with a simpler porter implementation, then try to enhance it to snowball. [~josephkb] The plan is to cover the most general cases with shorter code. After all, MLlib is not specific for NLP. > Stemmer feature transformer > --------------------------- > > Key: SPARK-9578 > URL: https://issues.apache.org/jira/browse/SPARK-9578 > Project: Spark > Issue Type: New Feature > Components: ML > Reporter: Joseph K. Bradley > Priority: Minor > > Transformer mentioned first in [SPARK-5571] based on suggestion from > [~aloknsingh]. Very standard NLP preprocessing task. > From [~aloknsingh]: > {quote} > We have one scala stemmer in scalanlp%chalk > https://github.com/scalanlp/chalk/tree/master/src/main/scala/chalk/text/analyze > which can easily copied (as it is apache project) and is in scala too. > I think this will be better alternative than lucene englishAnalyzer or > opennlp. > Note: we already use the scalanlp%breeze via the maven dependency so I think > adding scalanlp%chalk dependency is also the options. But as you had said we > can copy the code as it is small. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org