[jira] [Commented] (SPARK-5571) LDA should handle text as well

2015-08-03 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652787#comment-14652787
 ] 

Joseph K. Bradley commented on SPARK-5571:
--

The stopwords transformer made it for 1.5, but the stemmer will need to be in 
1.6.  Just linked them.

 LDA should handle text as well
 --

 Key: SPARK-5571
 URL: https://issues.apache.org/jira/browse/SPARK-5571
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently operates only on vectors of word 
 counts.  It should also supporting training and prediction using text 
 (Strings).
 This plan is sketched in the [original LDA design 
 doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing].
 There should be:
 * runWithText() method which takes an RDD with a collection of Strings (bags 
 of words).  This will also index terms and compute a dictionary.
 * dictionary parameter for when LDA is run with word count vectors
 * prediction/feedback methods returning Strings (such as 
 describeTopicsAsStrings, which is commented out in LDA currently)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5571) LDA should handle text as well

2015-07-20 Thread Alok Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633131#comment-14633131
 ] 

Alok Singh commented on SPARK-5571:
---

Ok it sounds good.

Stemmer:We have one scala stemmer in scalanlp%chalk 
https://github.com/scalanlp/chalk/tree/master/src/main/scala/chalk/text/analyze
which can easily copied (as it is apache project) and is in scala too.
I think this will be better alternative than lucene englishAnalyzer or opennlp.
Note: we already use the scalanlp%breeze via the maven dependency so I think 
adding scalanlp%chalk dependency is also the options. But as you had said we 
can copy the code as it is small. 


LDA.runText:sounds good. About the design doc, I think steps would be 
tokenize, stopword, stem, text2count, LDA.run(), return describeTopicAsString 
(which will be concat of the top stemmed words in the topic. ) 

Pipeline:I agree with the idea of the pipeline api can be added later and user 
can always use the LDA.runText from mllib. So we can just add a few more 
dependent jiras .Since 1.5 release has some time, we can have this feature 
implemented without the pipeline for the 1.5 release. 



 LDA should handle text as well
 --

 Key: SPARK-5571
 URL: https://issues.apache.org/jira/browse/SPARK-5571
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently operates only on vectors of word 
 counts.  It should also supporting training and prediction using text 
 (Strings).
 This plan is sketched in the [original LDA design 
 doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing].
 There should be:
 * runWithText() method which takes an RDD with a collection of Strings (bags 
 of words).  This will also index terms and compute a dictionary.
 * dictionary parameter for when LDA is run with word count vectors
 * prediction/feedback methods returning Strings (such as 
 describeTopicsAsStrings, which is commented out in LDA currently)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5571) LDA should handle text as well

2015-07-20 Thread Alok Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633132#comment-14633132
 ] 

Alok Singh commented on SPARK-5571:
---

Ok it sounds good.

Stemmer:We have one scala stemmer in scalanlp%chalk 
https://github.com/scalanlp/chalk/tree/master/src/main/scala/chalk/text/analyze
which can easily copied (as it is apache project) and is in scala too.
I think this will be better alternative than lucene englishAnalyzer or opennlp.
Note: we already use the scalanlp%breeze via the maven dependency so I think 
adding scalanlp%chalk dependency is also the options. But as you had said we 
can copy the code as it is small. 


LDA.runText:sounds good. About the design doc, I think steps would be 
tokenize, stopword, stem, text2count, LDA.run(), return describeTopicAsString 
(which will be concat of the top stemmed words in the topic. ) 

Pipeline:I agree with the idea of the pipeline api can be added later and user 
can always use the LDA.runText from mllib. So we can just add a few more 
dependent jiras .Since 1.5 release has some time, we can have this feature 
implemented without the pipeline for the 1.5 release. 



 LDA should handle text as well
 --

 Key: SPARK-5571
 URL: https://issues.apache.org/jira/browse/SPARK-5571
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently operates only on vectors of word 
 counts.  It should also supporting training and prediction using text 
 (Strings).
 This plan is sketched in the [original LDA design 
 doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing].
 There should be:
 * runWithText() method which takes an RDD with a collection of Strings (bags 
 of words).  This will also index terms and compute a dictionary.
 * dictionary parameter for when LDA is run with word count vectors
 * prediction/feedback methods returning Strings (such as 
 describeTopicsAsStrings, which is commented out in LDA currently)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5571) LDA should handle text as well

2015-07-20 Thread Alok Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633133#comment-14633133
 ] 

Alok Singh commented on SPARK-5571:
---

Ok it sounds good.

Stemmer:We have one scala stemmer in scalanlp%chalk 
https://github.com/scalanlp/chalk/tree/master/src/main/scala/chalk/text/analyze
which can easily copied (as it is apache project) and is in scala too.
I think this will be better alternative than lucene englishAnalyzer or opennlp.
Note: we already use the scalanlp%breeze via the maven dependency so I think 
adding scalanlp%chalk dependency is also the options. But as you had said we 
can copy the code as it is small. 


LDA.runText:sounds good. About the design doc, I think steps would be 
tokenize, stopword, stem, text2count, LDA.run(), return describeTopicAsString 
(which will be concat of the top stemmed words in the topic. ) 

Pipeline:I agree with the idea of the pipeline api can be added later and user 
can always use the LDA.runText from mllib. So we can just add a few more 
dependent jiras .Since 1.5 release has some time, we can have this feature 
implemented without the pipeline for the 1.5 release. 



 LDA should handle text as well
 --

 Key: SPARK-5571
 URL: https://issues.apache.org/jira/browse/SPARK-5571
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently operates only on vectors of word 
 counts.  It should also supporting training and prediction using text 
 (Strings).
 This plan is sketched in the [original LDA design 
 doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing].
 There should be:
 * runWithText() method which takes an RDD with a collection of Strings (bags 
 of words).  This will also index terms and compute a dictionary.
 * dictionary parameter for when LDA is run with word count vectors
 * prediction/feedback methods returning Strings (such as 
 describeTopicsAsStrings, which is commented out in LDA currently)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5571) LDA should handle text as well

2015-07-17 Thread Alok Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630871#comment-14630871
 ] 

Alok Singh commented on SPARK-5571:
---

Hi Feynman,

Sorry for the delay and gap, here at work , we had some training and few 
internal updates/changes and was not able to respond.


Here are my thoughts , please comments

stemmer

I think we will need the stemmer module too. I was thinking we can just create 
a wrapper over the Lucene EnglishAnalyzer Or the OpenNLP stemmer. This can be 
seperate transformer  jira under the 'ml' tag
Without this component, we will have a lot of edges and nodes in the created 
graphx.

Stopword

we can support two ways
- in one user give the list of stop words
-in another, we calculate it using the idf with tfi-idf transformer. We could 
create the new transformer which under the hood calls the tfi-df transformer 
with the filter range. This can also be another transformer jira under 'ml' tag.

The  LDA.runText
--
The core LDA.runText method can be under the mllib tag and can be easier with 
the assumption that 
the input bag of words just need to be passed to a  CountVectorizer and then to 
LDA.run.
which will be implemented as per the description.

The complete pipeline
-
User can create it's own pipeline using ml but I think we should create the 
TextLDA_Pipeline which will combine the above steps together and put it under 
'ml' tag jira


What are your thoughts [~josephkb] and [~fliang]

Alok


 LDA should handle text as well
 --

 Key: SPARK-5571
 URL: https://issues.apache.org/jira/browse/SPARK-5571
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently operates only on vectors of word 
 counts.  It should also supporting training and prediction using text 
 (Strings).
 This plan is sketched in the [original LDA design 
 doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing].
 There should be:
 * runWithText() method which takes an RDD with a collection of Strings (bags 
 of words).  This will also index terms and compute a dictionary.
 * dictionary parameter for when LDA is run with word count vectors
 * prediction/feedback methods returning Strings (such as 
 describeTopicsAsStrings, which is commented out in LDA currently)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5571) LDA should handle text as well

2015-07-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632052#comment-14632052
 ] 

Joseph K. Bradley commented on SPARK-5571:
--

Stemmer: We'll need to be careful about adding dependencies on other libraries. 
 We strongly prefer avoiding that if possible.  If code can be copied and 
modified (assuming the license is friendly to copying), that might be 
preferable if the code is relatively simple.

Stopwords: Sounds good.

LDA.runText: I'd prefer this handle everything automatically: A user gives an 
unfiltered corpus and LDA handles it.  This actually probably requires a quick 
design doc since I have not thought through the complexities.

Pipeline: I agree this might work well under the Pipelines API.  Here's what I 
propose:
* For now, we focus on adding the necessary transformers individually: stemmer, 
stopwords filter.
* For the next release, we design a good way to provide this functionality 
under Pipelines.

If that sounds good, we can create  link JIRAs for those transformers, and 
I'll move the target version for this JIRA to 1.6.  What do you think?

 LDA should handle text as well
 --

 Key: SPARK-5571
 URL: https://issues.apache.org/jira/browse/SPARK-5571
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently operates only on vectors of word 
 counts.  It should also supporting training and prediction using text 
 (Strings).
 This plan is sketched in the [original LDA design 
 doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing].
 There should be:
 * runWithText() method which takes an RDD with a collection of Strings (bags 
 of words).  This will also index terms and compute a dictionary.
 * dictionary parameter for when LDA is run with word count vectors
 * prediction/feedback methods returning Strings (such as 
 describeTopicsAsStrings, which is commented out in LDA currently)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5571) LDA should handle text as well

2015-07-12 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624152#comment-14624152
 ] 

Feynman Liang commented on SPARK-5571:
--

[~a...@jivesoftware.com], are you still working on this? I wanted to point out 
[CountVectorizer|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizerModel.scala]
 was recently merged and seems appropriate for this task.

If you aren't working on this anymore, I would be happy to take this task.

 LDA should handle text as well
 --

 Key: SPARK-5571
 URL: https://issues.apache.org/jira/browse/SPARK-5571
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently operates only on vectors of word 
 counts.  It should also supporting training and prediction using text 
 (Strings).
 This plan is sketched in the [original LDA design 
 doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing].
 There should be:
 * runWithText() method which takes an RDD with a collection of Strings (bags 
 of words).  This will also index terms and compute a dictionary.
 * dictionary parameter for when LDA is run with word count vectors
 * prediction/feedback methods returning Strings (such as 
 describeTopicsAsStrings, which is commented out in LDA currently)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5571) LDA should handle text as well

2015-06-29 Thread Alok Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606159#comment-14606159
 ] 

Alok Singh commented on SPARK-5571:
---

Since there is already Tokenizer class. We can assume other classes will be 
made. so one I can assume that input is already tokenized, stemmed and stopword 
removed.

 LDA should handle text as well
 --

 Key: SPARK-5571
 URL: https://issues.apache.org/jira/browse/SPARK-5571
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently operates only on vectors of word 
 counts.  It should also supporting training and prediction using text 
 (Strings).
 This plan is sketched in the [original LDA design 
 doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing].
 There should be:
 * runWithText() method which takes an RDD with a collection of Strings (bags 
 of words).  This will also index terms and compute a dictionary.
 * dictionary parameter for when LDA is run with word count vectors
 * prediction/feedback methods returning Strings (such as 
 describeTopicsAsStrings, which is commented out in LDA currently)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5571) LDA should handle text as well

2015-06-29 Thread Alok Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606161#comment-14606161
 ] 

Alok Singh commented on SPARK-5571:
---

I would like to work to it if everyone is ok .

 LDA should handle text as well
 --

 Key: SPARK-5571
 URL: https://issues.apache.org/jira/browse/SPARK-5571
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently operates only on vectors of word 
 counts.  It should also supporting training and prediction using text 
 (Strings).
 This plan is sketched in the [original LDA design 
 doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing].
 There should be:
 * runWithText() method which takes an RDD with a collection of Strings (bags 
 of words).  This will also index terms and compute a dictionary.
 * dictionary parameter for when LDA is run with word count vectors
 * prediction/feedback methods returning Strings (such as 
 describeTopicsAsStrings, which is commented out in LDA currently)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5571) LDA should handle text as well

2015-06-29 Thread Alok Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605200#comment-14605200
 ] 

Alok Singh commented on SPARK-5571:
---

Just wanted to get more clarification on this.
Does this jira , expect all the components i.e tokenizer - stemmer - 
stopword-runWithPrunedBagOfWords? or is it that we assume that  input is 
already tokenized, stemmed and stopword removed?



 LDA should handle text as well
 --

 Key: SPARK-5571
 URL: https://issues.apache.org/jira/browse/SPARK-5571
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently operates only on vectors of word 
 counts.  It should also supporting training and prediction using text 
 (Strings).
 This plan is sketched in the [original LDA design 
 doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing].
 There should be:
 * runWithText() method which takes an RDD with a collection of Strings (bags 
 of words).  This will also index terms and compute a dictionary.
 * dictionary parameter for when LDA is run with word count vectors
 * prediction/feedback methods returning Strings (such as 
 describeTopicsAsStrings, which is commented out in LDA currently)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5571) LDA should handle text as well

2015-06-29 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606869#comment-14606869
 ] 

Joseph K. Bradley commented on SPARK-5571:
--

Thanks for your interest!  The API should take text (before any preprocessing) 
and then handle preprocessing internally.  Internally, we should definitely 
take advantage of existing transformers to handle tokenization, stemming, 
stopword removal, etc.


 LDA should handle text as well
 --

 Key: SPARK-5571
 URL: https://issues.apache.org/jira/browse/SPARK-5571
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently operates only on vectors of word 
 counts.  It should also supporting training and prediction using text 
 (Strings).
 This plan is sketched in the [original LDA design 
 doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing].
 There should be:
 * runWithText() method which takes an RDD with a collection of Strings (bags 
 of words).  This will also index terms and compute a dictionary.
 * dictionary parameter for when LDA is run with word count vectors
 * prediction/feedback methods returning Strings (such as 
 describeTopicsAsStrings, which is commented out in LDA currently)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org