subject:"\[jira\] \[Commented\] \(SPARK\-5571\) LDA should handle text as well"

[jira] [Commented] (SPARK-5571) LDA should handle text as well

2015-08-03 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652787#comment-14652787
 ] 

Joseph K. Bradley commented on SPARK-5571:
--

The stopwords transformer made it for 1.5, but the stemmer will need to be in 
1.6.  Just linked them.

 LDA should handle text as well
 --

 Key: SPARK-5571
 URL: https://issues.apache.org/jira/browse/SPARK-5571
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently operates only on vectors of word 
 counts.  It should also supporting training and prediction using text 
 (Strings).
 This plan is sketched in the [original LDA design 
 doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing].
 There should be:
 * runWithText() method which takes an RDD with a collection of Strings (bags 
 of words).  This will also index terms and compute a dictionary.
 * dictionary parameter for when LDA is run with word count vectors
 * prediction/feedback methods returning Strings (such as 
 describeTopicsAsStrings, which is commented out in LDA currently)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5571) LDA should handle text as well

2015-07-20 Thread Alok Singh (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633131#comment-14633131
]

Alok Singh commented on SPARK-5571:
---

Ok it sounds good.

Stemmer:We have one scala stemmer in scalanlp%chalk
https://github.com/scalanlp/chalk/tree/master/src/main/scala/chalk/text/analyze
which can easily copied (as it is apache project) and is in scala too.
I think this will be better alternative than lucene englishAnalyzer or opennlp.
Note: we already use the scalanlp%breeze via the maven dependency so I think
adding scalanlp%chalk dependency is also the options. But as you had said we
can copy the code as it is small.

LDA.runText:sounds good. About the design doc, I think steps would be
tokenize, stopword, stem, text2count, LDA.run(), return describeTopicAsString
(which will be concat of the top stemmed words in the topic. )

Pipeline:I agree with the idea of the pipeline api can be added later and user
can always use the LDA.runText from mllib. So we can just add a few more
dependent jiras .Since 1.5 release has some time, we can have this feature
implemented without the pipeline for the 1.5 release.

LDA should handle text as well
--

Key: SPARK-5571
URL: https://issues.apache.org/jira/browse/SPARK-5571
Project: Spark
Issue Type: Improvement
Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

Latent Dirichlet Allocation (LDA) currently operates only on vectors of word
counts. It should also supporting training and prediction using text
(Strings).
This plan is sketched in the [original LDA design
doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing].
There should be:
* runWithText() method which takes an RDD with a collection of Strings (bags
of words). This will also index terms and compute a dictionary.
* dictionary parameter for when LDA is run with word count vectors
* prediction/feedback methods returning Strings (such as
describeTopicsAsStrings, which is commented out in LDA currently)

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5571) LDA should handle text as well

2015-07-20 Thread Alok Singh (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633132#comment-14633132
]

Alok Singh commented on SPARK-5571:
---

Ok it sounds good.

LDA should handle text as well
--

Key: SPARK-5571
URL: https://issues.apache.org/jira/browse/SPARK-5571
Project: Spark
Issue Type: Improvement
Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5571) LDA should handle text as well

2015-07-20 Thread Alok Singh (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633133#comment-14633133
]

Alok Singh commented on SPARK-5571:
---

Ok it sounds good.

LDA should handle text as well
--

Key: SPARK-5571
URL: https://issues.apache.org/jira/browse/SPARK-5571
Project: Spark
Issue Type: Improvement
Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5571) LDA should handle text as well

2015-07-17 Thread Alok Singh (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630871#comment-14630871
]

Alok Singh commented on SPARK-5571:
---

Hi Feynman,

Sorry for the delay and gap, here at work , we had some training and few
internal updates/changes and was not able to respond.

Here are my thoughts , please comments

stemmer

I think we will need the stemmer module too. I was thinking we can just create
a wrapper over the Lucene EnglishAnalyzer Or the OpenNLP stemmer. This can be
seperate transformer jira under the 'ml' tag
Without this component, we will have a lot of edges and nodes in the created
graphx.

Stopword

we can support two ways
- in one user give the list of stop words
-in another, we calculate it using the idf with tfi-idf transformer. We could
create the new transformer which under the hood calls the tfi-df transformer
with the filter range. This can also be another transformer jira under 'ml' tag.

The LDA.runText
--
The core LDA.runText method can be under the mllib tag and can be easier with
the assumption that
the input bag of words just need to be passed to a CountVectorizer and then to
LDA.run.
which will be implemented as per the description.

The complete pipeline
-
User can create it's own pipeline using ml but I think we should create the
TextLDA_Pipeline which will combine the above steps together and put it under
'ml' tag jira

What are your thoughts [~josephkb] and [~fliang]

Alok

LDA should handle text as well
--

Key: SPARK-5571
URL: https://issues.apache.org/jira/browse/SPARK-5571
Project: Spark
Issue Type: Improvement
Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5571) LDA should handle text as well

2015-07-17 Thread Joseph K. Bradley (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632052#comment-14632052
]

Joseph K. Bradley commented on SPARK-5571:
--

Stemmer: We'll need to be careful about adding dependencies on other libraries.
We strongly prefer avoiding that if possible. If code can be copied and
modified (assuming the license is friendly to copying), that might be
preferable if the code is relatively simple.

Stopwords: Sounds good.

LDA.runText: I'd prefer this handle everything automatically: A user gives an
unfiltered corpus and LDA handles it. This actually probably requires a quick
design doc since I have not thought through the complexities.

Pipeline: I agree this might work well under the Pipelines API. Here's what I
propose:
* For now, we focus on adding the necessary transformers individually: stemmer,
stopwords filter.
* For the next release, we design a good way to provide this functionality
under Pipelines.

If that sounds good, we can create link JIRAs for those transformers, and
I'll move the target version for this JIRA to 1.6. What do you think?

LDA should handle text as well
--

Key: SPARK-5571
URL: https://issues.apache.org/jira/browse/SPARK-5571
Project: Spark
Issue Type: Improvement
Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5571) LDA should handle text as well

2015-07-12 Thread Feynman Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624152#comment-14624152
 ] 

Feynman Liang commented on SPARK-5571:
--

[~a...@jivesoftware.com], are you still working on this? I wanted to point out 
[CountVectorizer|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizerModel.scala]
 was recently merged and seems appropriate for this task.

If you aren't working on this anymore, I would be happy to take this task.

 LDA should handle text as well
 --

 Key: SPARK-5571
 URL: https://issues.apache.org/jira/browse/SPARK-5571
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently operates only on vectors of word 
 counts.  It should also supporting training and prediction using text 
 (Strings).
 This plan is sketched in the [original LDA design 
 doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing].
 There should be:
 * runWithText() method which takes an RDD with a collection of Strings (bags 
 of words).  This will also index terms and compute a dictionary.
 * dictionary parameter for when LDA is run with word count vectors
 * prediction/feedback methods returning Strings (such as 
 describeTopicsAsStrings, which is commented out in LDA currently)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5571) LDA should handle text as well

2015-06-29 Thread Alok Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606159#comment-14606159
 ] 

Alok Singh commented on SPARK-5571:
---

Since there is already Tokenizer class. We can assume other classes will be 
made. so one I can assume that input is already tokenized, stemmed and stopword 
removed.

 LDA should handle text as well
 --

 Key: SPARK-5571
 URL: https://issues.apache.org/jira/browse/SPARK-5571
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently operates only on vectors of word 
 counts.  It should also supporting training and prediction using text 
 (Strings).
 This plan is sketched in the [original LDA design 
 doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing].
 There should be:
 * runWithText() method which takes an RDD with a collection of Strings (bags 
 of words).  This will also index terms and compute a dictionary.
 * dictionary parameter for when LDA is run with word count vectors
 * prediction/feedback methods returning Strings (such as 
 describeTopicsAsStrings, which is commented out in LDA currently)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5571) LDA should handle text as well

2015-06-29 Thread Alok Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606161#comment-14606161
 ] 

Alok Singh commented on SPARK-5571:
---

I would like to work to it if everyone is ok .

 LDA should handle text as well
 --

 Key: SPARK-5571
 URL: https://issues.apache.org/jira/browse/SPARK-5571
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 Latent Dirichlet Allocation (LDA) currently operates only on vectors of word 
 counts.  It should also supporting training and prediction using text 
 (Strings).
 This plan is sketched in the [original LDA design 
 doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing].
 There should be:
 * runWithText() method which takes an RDD with a collection of Strings (bags 
 of words).  This will also index terms and compute a dictionary.
 * dictionary parameter for when LDA is run with word count vectors
 * prediction/feedback methods returning Strings (such as 
 describeTopicsAsStrings, which is commented out in LDA currently)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5571) LDA should handle text as well

2015-06-29 Thread Alok Singh (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605200#comment-14605200
]

Alok Singh commented on SPARK-5571:
---

Just wanted to get more clarification on this.
Does this jira , expect all the components i.e tokenizer - stemmer -
stopword-runWithPrunedBagOfWords? or is it that we assume that input is
already tokenized, stemmed and stopword removed?

LDA should handle text as well
--

Key: SPARK-5571
URL: https://issues.apache.org/jira/browse/SPARK-5571
Project: Spark
Issue Type: Improvement
Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5571) LDA should handle text as well

2015-06-29 Thread Joseph K. Bradley (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606869#comment-14606869
]

Joseph K. Bradley commented on SPARK-5571:
--

Thanks for your interest! The API should take text (before any preprocessing)
and then handle preprocessing internally. Internally, we should definitely
take advantage of existing transformers to handle tokenization, stemming,
stopword removal, etc.

LDA should handle text as well
--

Key: SPARK-5571
URL: https://issues.apache.org/jira/browse/SPARK-5571
Project: Spark
Issue Type: Improvement
Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5571) LDA should handle text as well

[jira] [Commented] (SPARK-5571) LDA should handle text as well

[jira] [Commented] (SPARK-5571) LDA should handle text as well

[jira] [Commented] (SPARK-5571) LDA should handle text as well

[jira] [Commented] (SPARK-5571) LDA should handle text as well

[jira] [Commented] (SPARK-5571) LDA should handle text as well

[jira] [Commented] (SPARK-5571) LDA should handle text as well

[jira] [Commented] (SPARK-5571) LDA should handle text as well

[jira] [Commented] (SPARK-5571) LDA should handle text as well

[jira] [Commented] (SPARK-5571) LDA should handle text as well

[jira] [Commented] (SPARK-5571) LDA should handle text as well

11 matches

Site Navigation

Mail list logo

Footer information