[ 
https://issues.apache.org/jira/browse/SPARK-5560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14704224#comment-14704224
 ] 

Joseph K. Bradley commented on SPARK-5560:
------------------------------------------

I'm doing some testing on this right now using 8 r3.2xlarge instances and Spark 
1.4, and it actually does scale to many more iterations.  There must have been 
issues fixed in Spark core which affected 1.3.  I've run for 30 iterations 
without incident, where each iteration took about 45 seconds (except the first, 
which took 4.5 min).

I'm going to close this issue.

> LDA EM should scale to more iterations
> --------------------------------------
>
>                 Key: SPARK-5560
>                 URL: https://issues.apache.org/jira/browse/SPARK-5560
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.3.0
>            Reporter: Joseph K. Bradley
>             Fix For: 1.4.0
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (LDA) sometimes fails to run for many iterations 
> on large datasets, even when it is able to run for a few iterations.  It 
> should be able to run for as many iterations as the user likes, with proper 
> persistence and checkpointing.
> Here is an example from a test on 16 workers (EC2 r3.2xlarge) on a big 
> Wikipedia dataset:
> * 100 topics
> * Training set size: 4072243 documents
> * Vocabulary size: 9869422 terms
> * Training set size: 1041734290 tokens
> It runs for about 10-15 iterations before failing, even when using a variety 
> of checkpointInterval values and longer timeout settings (up to 5 minutes).  
> The failure varies from disconnections from workers/driver to workers running 
> out of disk space.  I would not expect workers to run out of memory or disk 
> space based on rough calculations.  There was some job imbalance, but not a 
> significant amount.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to