[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387085#comment-14387085
 ] 

Joseph K. Bradley commented on SPARK-5564:
------------------------------------------

[~debasish83] I mainly used a Wikipedia dataset.  Here's an S3 bucket 
(requestor pays) which [~sparks] created: 
[s3://files.sparks.requester.pays/enwiki_category_text/] which holds a big 
Wikipedia dataset.  I'm not sure if it's the same one I used, but it should be 
similar qualitatively.  Mine had ~1.1 billion tokens, with about 1 million 
documents and 1 million terms (vocab size).

As far as scaling, the EM code scaled linearly with the number of topics K.  
Communication was the bottleneck for sizable datasets, and it scales linearly 
with K.  The largest K I've run with on that dataset was K=100; that was using 
16 r3.2xlarge workers.

> Support sparse LDA solutions
> ----------------------------
>
>                 Key: SPARK-5564
>                 URL: https://issues.apache.org/jira/browse/SPARK-5564
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.3.0
>            Reporter: Joseph K. Bradley
>
> Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
> concentration parameters be > 1.0.  It should support values > 0.0, which 
> should encourage sparser topics (phi) and document-topic distributions 
> (theta).
> For EM, this will require adding a projection to the M-step, as in: Vorontsov 
> and Potapenko. "Tutorial on Probabilistic Topic Modeling : Additive 
> Regularization for Stochastic Matrix Factorization." 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to