[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14272184#comment-14272184 ]
Pedro Rodriguez commented on SPARK-1405: ---------------------------------------- Sounds good Joseph. Have some good news. I finished initial testing of runtime vs scaling of topics this morning. You can find the raw numbers below. The top set is from the fast lda from the paper above, the bottom set is using ordinary sampling. They are resultwise equivalent, but the fast version takes advantage that the majority of the "mass" of the cdf is concentrated in only a few topics, so it is smart to check those first. https://docs.google.com/spreadsheets/d/1RZLsnfLL2XmKWNJ6kPM_KaiDkTfQOYdXdG5EywlY3Os/edit?usp=sharing Here are descriptions of each time: Setup: time for setup operations, such as creating the graph Resample: time used in the Gibbs sampling step to sample new topics Update: time spent applying the topic deltas to the histogram of each vertex Global: time spent updating the global histogram As can be seen in the data, the time for resampling flattens/stops increasing after ~500 topics, achieving superlinear/constant compute time with number of topics, which is exactly what fast (Gibbs sampling) lda is suppose to do. All these tests were run locally on my macbook pro with 3GB executor memory. I used the data generator with these parameters: alpha=beta=.01 nDocs=300 nWords=14036 (number of unique words) nTokensPerDoc=1000 Number of Tokens (equivalent to nDocs*nTokensPerDoc)=300000 Number of Iterations=10 Since this was a fairly successful test, my next task is running on an ec2 cluster. To be able to compare easily to Guoqiang's testing, I will use a similar 4 node cluster and change the data generator parameters to create a similar data set, but run for less iterations (probably 10x less, so 15). My goal is to get that done over the weekend. If that goes well, then I will start refactoring to match the proposed API by Joseph. > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > ----------------------------------------------------------------- > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: Xusen Yin > Assignee: Guoqiang Li > Priority: Critical > Labels: features > Attachments: performance_comparison.png > > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org