[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14272184#comment-14272184
 ] 

Pedro Rodriguez commented on SPARK-1405:
----------------------------------------

Sounds good Joseph. Have some good news. I finished initial testing of runtime 
vs scaling of topics this morning. You can find the raw numbers below. The top 
set is from the fast lda from the paper above, the bottom set is using ordinary 
sampling. They are resultwise equivalent, but the fast version takes advantage 
that the majority of the "mass" of the cdf is concentrated in only a few 
topics, so it is smart to check those first.
https://docs.google.com/spreadsheets/d/1RZLsnfLL2XmKWNJ6kPM_KaiDkTfQOYdXdG5EywlY3Os/edit?usp=sharing

Here are descriptions of each time:
Setup: time for setup operations, such as creating the graph
Resample: time used in the Gibbs sampling step to sample new topics
Update: time spent applying the topic deltas to the histogram of each vertex
Global: time spent updating the global histogram

As can be seen in the data, the time for resampling flattens/stops increasing 
after ~500 topics, achieving superlinear/constant compute time with number of 
topics, which is exactly what fast (Gibbs sampling) lda is suppose to do.

All these tests were run locally on my macbook pro with 3GB executor memory. I 
used the data generator with these parameters:
alpha=beta=.01
nDocs=300
nWords=14036 (number of unique words)
nTokensPerDoc=1000
Number of Tokens (equivalent to nDocs*nTokensPerDoc)=300000
Number of Iterations=10

Since this was a fairly successful test, my next task is running on an ec2 
cluster. To be able to compare easily to Guoqiang's testing, I will use a 
similar 4 node cluster and change the data generator parameters to create a 
similar data set, but run for less iterations (probably 10x less, so 15). My 
goal is to get that done over the weekend. If that goes well, then I will start 
refactoring to match the proposed API by Joseph.

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -----------------------------------------------------------------
>
>                 Key: SPARK-1405
>                 URL: https://issues.apache.org/jira/browse/SPARK-1405
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Xusen Yin
>            Assignee: Guoqiang Li
>            Priority: Critical
>              Labels: features
>         Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to