[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

Pedro Rodriguez (JIRA) Thu, 25 Sep 2014 17:48:07 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148555#comment-14148555
 ]


Pedro Rodriguez commented on SPARK-1405:
----------------------------------------

[~mengxr], definitely a good idea to be coordinated about it. I have been 
working with Evan so have been giving status updates and making todos with him. 
I will post here on progress updates as well. 

I have been working on creating a design doc/reference which you can find here: 
https://docs.google.com/document/d/13MfroPXEEGKgaQaZlHkg1wdJMtCN5d8aHJuVkiOrOK4/edit?usp=sharing

It is for a large part, a way for us/me to keep notes while working, but I 
would like to take some of it and convert it to documentation. It primarily 
contains
1. Relevant links to papers/code/repositories
2. Thorough explanation/documentation of LDA and motivation behind the graph 
implementation (Joey's version)
3. Testing steps (which data sets on what)
4. Current todos (perhaps we should post them here primarily and update doc for 
consistency).

1. Currently I am working on testing (unit test functions and correctness 
testing), refactoring, and extending Joey's implementation. The objective for 
this week is to have the mini-test running (a set of ~10 documents which acts 
as a sanity check). Goal for early next week is to be running on NIPS. I think 
the majority of time to get there will be putting the dataset in a parseable 
format (remove equations, stop words...) and insuring that the result looks 
correct. 

To that end, we plan on running the same datasets through Graphlab for 
benchmarking machine/ML performance and a python implementation for ML 
performance/correctness. 

Once we are there, the plan is to start looking at running on wikipedia.

2. The code I am currently working on lives here:
https://github.com/EntilZha/spark
https://github.com/EntilZha/spark/blob/LDA/graphx/src/main/scala/org/apache/spark/graphx/lib/TopicModeling.scala
which is within GraphX, with the other graph based algorithms.

3. Prior to knowing about Joey's graph implementation, I wrote my own for a 
final project. I stopped working on it since the graph implementation should be 
more performant. Probably a good point of discussion if there should be a 
"standard" and graph implementation together. When you reference standard 
implementation, is there a particular implementation you are referring to that 
I can look at?

TLDR timeline:
End of this week: mini-dataset for sanity check + refactoring code + unit 
testing
Next week: Format NIPS for input + run NIPS data set on Spark, GraphLab, and 
Python LDA. I will be away from Berkeley at a conference, but hope to still get 
those done.
>From there, we would like to get running on larger datasets for performance 
>testing.

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -----------------------------------------------------------------
>
>                 Key: SPARK-1405
>                 URL: https://issues.apache.org/jira/browse/SPARK-1405
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Xusen Yin
>            Assignee: Guoqiang Li
>              Labels: features
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

Reply via email to