[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387085#comment-14387085 ]
Joseph K. Bradley commented on SPARK-5564: ------------------------------------------ [~debasish83] I mainly used a Wikipedia dataset. Here's an S3 bucket (requestor pays) which [~sparks] created: [s3://files.sparks.requester.pays/enwiki_category_text/] which holds a big Wikipedia dataset. I'm not sure if it's the same one I used, but it should be similar qualitatively. Mine had ~1.1 billion tokens, with about 1 million documents and 1 million terms (vocab size). As far as scaling, the EM code scaled linearly with the number of topics K. Communication was the bottleneck for sizable datasets, and it scales linearly with K. The largest K I've run with on that dataset was K=100; that was using 16 r3.2xlarge workers. > Support sparse LDA solutions > ---------------------------- > > Key: SPARK-5564 > URL: https://issues.apache.org/jira/browse/SPARK-5564 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 1.3.0 > Reporter: Joseph K. Bradley > > Latent Dirichlet Allocation (LDA) currently requires that the priors’ > concentration parameters be > 1.0. It should support values > 0.0, which > should encourage sparser topics (phi) and document-topic distributions > (theta). > For EM, this will require adding a projection to the M-step, as in: Vorontsov > and Potapenko. "Tutorial on Probabilistic Topic Modeling : Additive > Regularization for Stochastic Matrix Factorization." 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org