[ 
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14606869#comment-14606869
 ] 

Joseph K. Bradley commented on SPARK-5571:
------------------------------------------

Thanks for your interest!  The API should take text (before any preprocessing) 
and then handle preprocessing internally.  Internally, we should definitely 
take advantage of existing transformers to handle tokenization, stemming, 
stopword removal, etc.


> LDA should handle text as well
> ------------------------------
>
>                 Key: SPARK-5571
>                 URL: https://issues.apache.org/jira/browse/SPARK-5571
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.3.0
>            Reporter: Joseph K. Bradley
>
> Latent Dirichlet Allocation (LDA) currently operates only on vectors of word 
> counts.  It should also supporting training and prediction using text 
> (Strings).
> This plan is sketched in the [original LDA design 
> doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing].
> There should be:
> * runWithText() method which takes an RDD with a collection of Strings (bags 
> of words).  This will also index terms and compute a dictionary.
> * dictionary parameter for when LDA is run with word count vectors
> * prediction/feedback methods returning Strings (such as 
> describeTopicsAsStrings, which is commented out in LDA currently)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to