[jira] [Commented] (SPARK-5571) LDA should handle text as well
[ https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652787#comment-14652787 ] Joseph K. Bradley commented on SPARK-5571: -- The stopwords transformer made it for 1.5, but the stemmer will need to be in 1.6. Just linked them. LDA should handle text as well -- Key: SPARK-5571 URL: https://issues.apache.org/jira/browse/SPARK-5571 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently operates only on vectors of word counts. It should also supporting training and prediction using text (Strings). This plan is sketched in the [original LDA design doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing]. There should be: * runWithText() method which takes an RDD with a collection of Strings (bags of words). This will also index terms and compute a dictionary. * dictionary parameter for when LDA is run with word count vectors * prediction/feedback methods returning Strings (such as describeTopicsAsStrings, which is commented out in LDA currently) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5571) LDA should handle text as well
[ https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633131#comment-14633131 ] Alok Singh commented on SPARK-5571: --- Ok it sounds good. Stemmer:We have one scala stemmer in scalanlp%chalk https://github.com/scalanlp/chalk/tree/master/src/main/scala/chalk/text/analyze which can easily copied (as it is apache project) and is in scala too. I think this will be better alternative than lucene englishAnalyzer or opennlp. Note: we already use the scalanlp%breeze via the maven dependency so I think adding scalanlp%chalk dependency is also the options. But as you had said we can copy the code as it is small. LDA.runText:sounds good. About the design doc, I think steps would be tokenize, stopword, stem, text2count, LDA.run(), return describeTopicAsString (which will be concat of the top stemmed words in the topic. ) Pipeline:I agree with the idea of the pipeline api can be added later and user can always use the LDA.runText from mllib. So we can just add a few more dependent jiras .Since 1.5 release has some time, we can have this feature implemented without the pipeline for the 1.5 release. LDA should handle text as well -- Key: SPARK-5571 URL: https://issues.apache.org/jira/browse/SPARK-5571 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently operates only on vectors of word counts. It should also supporting training and prediction using text (Strings). This plan is sketched in the [original LDA design doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing]. There should be: * runWithText() method which takes an RDD with a collection of Strings (bags of words). This will also index terms and compute a dictionary. * dictionary parameter for when LDA is run with word count vectors * prediction/feedback methods returning Strings (such as describeTopicsAsStrings, which is commented out in LDA currently) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5571) LDA should handle text as well
[ https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633132#comment-14633132 ] Alok Singh commented on SPARK-5571: --- Ok it sounds good. Stemmer:We have one scala stemmer in scalanlp%chalk https://github.com/scalanlp/chalk/tree/master/src/main/scala/chalk/text/analyze which can easily copied (as it is apache project) and is in scala too. I think this will be better alternative than lucene englishAnalyzer or opennlp. Note: we already use the scalanlp%breeze via the maven dependency so I think adding scalanlp%chalk dependency is also the options. But as you had said we can copy the code as it is small. LDA.runText:sounds good. About the design doc, I think steps would be tokenize, stopword, stem, text2count, LDA.run(), return describeTopicAsString (which will be concat of the top stemmed words in the topic. ) Pipeline:I agree with the idea of the pipeline api can be added later and user can always use the LDA.runText from mllib. So we can just add a few more dependent jiras .Since 1.5 release has some time, we can have this feature implemented without the pipeline for the 1.5 release. LDA should handle text as well -- Key: SPARK-5571 URL: https://issues.apache.org/jira/browse/SPARK-5571 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently operates only on vectors of word counts. It should also supporting training and prediction using text (Strings). This plan is sketched in the [original LDA design doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing]. There should be: * runWithText() method which takes an RDD with a collection of Strings (bags of words). This will also index terms and compute a dictionary. * dictionary parameter for when LDA is run with word count vectors * prediction/feedback methods returning Strings (such as describeTopicsAsStrings, which is commented out in LDA currently) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5571) LDA should handle text as well
[ https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633133#comment-14633133 ] Alok Singh commented on SPARK-5571: --- Ok it sounds good. Stemmer:We have one scala stemmer in scalanlp%chalk https://github.com/scalanlp/chalk/tree/master/src/main/scala/chalk/text/analyze which can easily copied (as it is apache project) and is in scala too. I think this will be better alternative than lucene englishAnalyzer or opennlp. Note: we already use the scalanlp%breeze via the maven dependency so I think adding scalanlp%chalk dependency is also the options. But as you had said we can copy the code as it is small. LDA.runText:sounds good. About the design doc, I think steps would be tokenize, stopword, stem, text2count, LDA.run(), return describeTopicAsString (which will be concat of the top stemmed words in the topic. ) Pipeline:I agree with the idea of the pipeline api can be added later and user can always use the LDA.runText from mllib. So we can just add a few more dependent jiras .Since 1.5 release has some time, we can have this feature implemented without the pipeline for the 1.5 release. LDA should handle text as well -- Key: SPARK-5571 URL: https://issues.apache.org/jira/browse/SPARK-5571 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently operates only on vectors of word counts. It should also supporting training and prediction using text (Strings). This plan is sketched in the [original LDA design doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing]. There should be: * runWithText() method which takes an RDD with a collection of Strings (bags of words). This will also index terms and compute a dictionary. * dictionary parameter for when LDA is run with word count vectors * prediction/feedback methods returning Strings (such as describeTopicsAsStrings, which is commented out in LDA currently) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5571) LDA should handle text as well
[ https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630871#comment-14630871 ] Alok Singh commented on SPARK-5571: --- Hi Feynman, Sorry for the delay and gap, here at work , we had some training and few internal updates/changes and was not able to respond. Here are my thoughts , please comments stemmer I think we will need the stemmer module too. I was thinking we can just create a wrapper over the Lucene EnglishAnalyzer Or the OpenNLP stemmer. This can be seperate transformer jira under the 'ml' tag Without this component, we will have a lot of edges and nodes in the created graphx. Stopword we can support two ways - in one user give the list of stop words -in another, we calculate it using the idf with tfi-idf transformer. We could create the new transformer which under the hood calls the tfi-df transformer with the filter range. This can also be another transformer jira under 'ml' tag. The LDA.runText -- The core LDA.runText method can be under the mllib tag and can be easier with the assumption that the input bag of words just need to be passed to a CountVectorizer and then to LDA.run. which will be implemented as per the description. The complete pipeline - User can create it's own pipeline using ml but I think we should create the TextLDA_Pipeline which will combine the above steps together and put it under 'ml' tag jira What are your thoughts [~josephkb] and [~fliang] Alok LDA should handle text as well -- Key: SPARK-5571 URL: https://issues.apache.org/jira/browse/SPARK-5571 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently operates only on vectors of word counts. It should also supporting training and prediction using text (Strings). This plan is sketched in the [original LDA design doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing]. There should be: * runWithText() method which takes an RDD with a collection of Strings (bags of words). This will also index terms and compute a dictionary. * dictionary parameter for when LDA is run with word count vectors * prediction/feedback methods returning Strings (such as describeTopicsAsStrings, which is commented out in LDA currently) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5571) LDA should handle text as well
[ https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632052#comment-14632052 ] Joseph K. Bradley commented on SPARK-5571: -- Stemmer: We'll need to be careful about adding dependencies on other libraries. We strongly prefer avoiding that if possible. If code can be copied and modified (assuming the license is friendly to copying), that might be preferable if the code is relatively simple. Stopwords: Sounds good. LDA.runText: I'd prefer this handle everything automatically: A user gives an unfiltered corpus and LDA handles it. This actually probably requires a quick design doc since I have not thought through the complexities. Pipeline: I agree this might work well under the Pipelines API. Here's what I propose: * For now, we focus on adding the necessary transformers individually: stemmer, stopwords filter. * For the next release, we design a good way to provide this functionality under Pipelines. If that sounds good, we can create link JIRAs for those transformers, and I'll move the target version for this JIRA to 1.6. What do you think? LDA should handle text as well -- Key: SPARK-5571 URL: https://issues.apache.org/jira/browse/SPARK-5571 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently operates only on vectors of word counts. It should also supporting training and prediction using text (Strings). This plan is sketched in the [original LDA design doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing]. There should be: * runWithText() method which takes an RDD with a collection of Strings (bags of words). This will also index terms and compute a dictionary. * dictionary parameter for when LDA is run with word count vectors * prediction/feedback methods returning Strings (such as describeTopicsAsStrings, which is commented out in LDA currently) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5571) LDA should handle text as well
[ https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624152#comment-14624152 ] Feynman Liang commented on SPARK-5571: -- [~a...@jivesoftware.com], are you still working on this? I wanted to point out [CountVectorizer|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizerModel.scala] was recently merged and seems appropriate for this task. If you aren't working on this anymore, I would be happy to take this task. LDA should handle text as well -- Key: SPARK-5571 URL: https://issues.apache.org/jira/browse/SPARK-5571 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently operates only on vectors of word counts. It should also supporting training and prediction using text (Strings). This plan is sketched in the [original LDA design doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing]. There should be: * runWithText() method which takes an RDD with a collection of Strings (bags of words). This will also index terms and compute a dictionary. * dictionary parameter for when LDA is run with word count vectors * prediction/feedback methods returning Strings (such as describeTopicsAsStrings, which is commented out in LDA currently) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5571) LDA should handle text as well
[ https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606159#comment-14606159 ] Alok Singh commented on SPARK-5571: --- Since there is already Tokenizer class. We can assume other classes will be made. so one I can assume that input is already tokenized, stemmed and stopword removed. LDA should handle text as well -- Key: SPARK-5571 URL: https://issues.apache.org/jira/browse/SPARK-5571 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently operates only on vectors of word counts. It should also supporting training and prediction using text (Strings). This plan is sketched in the [original LDA design doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing]. There should be: * runWithText() method which takes an RDD with a collection of Strings (bags of words). This will also index terms and compute a dictionary. * dictionary parameter for when LDA is run with word count vectors * prediction/feedback methods returning Strings (such as describeTopicsAsStrings, which is commented out in LDA currently) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5571) LDA should handle text as well
[ https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606161#comment-14606161 ] Alok Singh commented on SPARK-5571: --- I would like to work to it if everyone is ok . LDA should handle text as well -- Key: SPARK-5571 URL: https://issues.apache.org/jira/browse/SPARK-5571 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently operates only on vectors of word counts. It should also supporting training and prediction using text (Strings). This plan is sketched in the [original LDA design doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing]. There should be: * runWithText() method which takes an RDD with a collection of Strings (bags of words). This will also index terms and compute a dictionary. * dictionary parameter for when LDA is run with word count vectors * prediction/feedback methods returning Strings (such as describeTopicsAsStrings, which is commented out in LDA currently) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5571) LDA should handle text as well
[ https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605200#comment-14605200 ] Alok Singh commented on SPARK-5571: --- Just wanted to get more clarification on this. Does this jira , expect all the components i.e tokenizer - stemmer - stopword-runWithPrunedBagOfWords? or is it that we assume that input is already tokenized, stemmed and stopword removed? LDA should handle text as well -- Key: SPARK-5571 URL: https://issues.apache.org/jira/browse/SPARK-5571 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently operates only on vectors of word counts. It should also supporting training and prediction using text (Strings). This plan is sketched in the [original LDA design doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing]. There should be: * runWithText() method which takes an RDD with a collection of Strings (bags of words). This will also index terms and compute a dictionary. * dictionary parameter for when LDA is run with word count vectors * prediction/feedback methods returning Strings (such as describeTopicsAsStrings, which is commented out in LDA currently) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5571) LDA should handle text as well
[ https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606869#comment-14606869 ] Joseph K. Bradley commented on SPARK-5571: -- Thanks for your interest! The API should take text (before any preprocessing) and then handle preprocessing internally. Internally, we should definitely take advantage of existing transformers to handle tokenization, stemming, stopword removal, etc. LDA should handle text as well -- Key: SPARK-5571 URL: https://issues.apache.org/jira/browse/SPARK-5571 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently operates only on vectors of word counts. It should also supporting training and prediction using text (Strings). This plan is sketched in the [original LDA design doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing]. There should be: * runWithText() method which takes an RDD with a collection of Strings (bags of words). This will also index terms and compute a dictionary. * dictionary parameter for when LDA is run with word count vectors * prediction/feedback methods returning Strings (such as describeTopicsAsStrings, which is commented out in LDA currently) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org