[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278381#comment-14278381
 ] 

Pedro Rodriguez commented on SPARK-1405:
----------------------------------------

Worked on some preliminary testing results, but ran into a snag which greatly 
limited what I could test.

Used 4 EC2 r3.2xlarge instances, which totals 32 executors with 215GB memory, 
so fairly close to tests run by [~gq].

I have been using the data generator I wrote, but had not previously stress 
tested it for a large number of topics, and it is failing for some number of 
topics between 100-1000. I made some optimizations with mapPartitions, but 
still run into GC overhead problems. I am unsure if this is my fault or if 
somewhere breeze is generating lots of garbage when sampling from the 
Dirichlet/Multinomial distributions. Extra eyes would be great to see if it 
looks reasonable from GC perspective: 
https://github.com/EntilZha/spark/blob/LDA/mllib/src/main/scala/org/apache/spark/mllib/util/LDADataGenerator.scala

Alternatively, [~gq], would it be possible to either get the data set you used 
for testing or are you using a different dataset for testing  [~josephkb]? It 
would be useful to have a common way to compare implementation performance

Here are the test results I did get:
Docs: 250,000
Words: 75,000
Tokens: 30,000,000
Iterations: 15
Topics: 100
Alpha=Beta=.01

Setup Time: 20s
Resampling Time: 80s
Updating Counts Time: 53s
Global Counts Time: 4s
Total Time: 170s (2.83m)

This looks like a good improvement over the original numbers (red blue plot) 
extrapolation that this ran for 10x less iterations, it would be 28m vs ~45m. I 
am fairly confident that this time will scale fairly well with number of topics 
based on the previous test results I posted. I would be more than happy to run 
more benchmark tests if I can get access to the data set used for the other 
tests or what Joseph is using to test his PR. I am also going to start working 
on refactoring into Joseph's API, will open a PR once that is done, probably 
later this week. It would be great to have both Gibbs and EM for next release.

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -----------------------------------------------------------------
>
>                 Key: SPARK-1405
>                 URL: https://issues.apache.org/jira/browse/SPARK-1405
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Xusen Yin
>            Assignee: Guoqiang Li
>            Priority: Critical
>              Labels: features
>         Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to