[jira] [Commented] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123803#comment-16123803 ] Valeriy Avanesov commented on SPARK-5564: - I am considering working on this issue. The question is whether there should be another EMLDAOptimizerVorontsov or shall the existing EMLDAOptimizer be re-written. > Support sparse LDA solutions > > > Key: SPARK-5564 > URL: https://issues.apache.org/jira/browse/SPARK-5564 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > Latent Dirichlet Allocation (LDA) currently requires that the priors’ > concentration parameters be > 1.0. It should support values > 0.0, which > should encourage sparser topics (phi) and document-topic distributions > (theta). > For EM, this will require adding a projection to the M-step, as in: Vorontsov > and Potapenko. "Tutorial on Probabilistic Topic Modeling : Additive > Regularization for Stochastic Matrix Factorization." 2014. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632712#comment-14632712 ] Apache Spark commented on SPARK-5564: - User 'feynmanliang' has created a pull request for this issue: https://github.com/apache/spark/pull/7507 Support sparse LDA solutions Key: SPARK-5564 URL: https://issues.apache.org/jira/browse/SPARK-5564 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently requires that the priors’ concentration parameters be 1.0. It should support values 0.0, which should encourage sparser topics (phi) and document-topic distributions (theta). For EM, this will require adding a projection to the M-step, as in: Vorontsov and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive Regularization for Stochastic Matrix Factorization. 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632715#comment-14632715 ] Apache Spark commented on SPARK-5564: - User 'feynmanliang' has created a pull request for this issue: https://github.com/apache/spark/pull/7507 Support sparse LDA solutions Key: SPARK-5564 URL: https://issues.apache.org/jira/browse/SPARK-5564 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently requires that the priors’ concentration parameters be 1.0. It should support values 0.0, which should encourage sparser topics (phi) and document-topic distributions (theta). For EM, this will require adding a projection to the M-step, as in: Vorontsov and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive Regularization for Stochastic Matrix Factorization. 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632718#comment-14632718 ] Feynman Liang commented on SPARK-5564: -- Sorry, tagged the wrong JIRA. Ignore the above PR; I'm not currently working on this. Support sparse LDA solutions Key: SPARK-5564 URL: https://issues.apache.org/jira/browse/SPARK-5564 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently requires that the priors’ concentration parameters be 1.0. It should support values 0.0, which should encourage sparser topics (phi) and document-topic distributions (theta). For EM, this will require adding a projection to the M-step, as in: Vorontsov and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive Regularization for Stochastic Matrix Factorization. 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632719#comment-14632719 ] Feynman Liang commented on SPARK-5564: -- [~josephkb] can you reset this status to open? Support sparse LDA solutions Key: SPARK-5564 URL: https://issues.apache.org/jira/browse/SPARK-5564 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently requires that the priors’ concentration parameters be 1.0. It should support values 0.0, which should encourage sparser topics (phi) and document-topic distributions (theta). For EM, this will require adding a projection to the M-step, as in: Vorontsov and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive Regularization for Stochastic Matrix Factorization. 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388128#comment-14388128 ] Debasish Das commented on SPARK-5564: - [~sparks] we are trying to access the EC2 dataset but giving error: [ec2-user@ip-172-31-38-56 ~]$ aws s3 ls s3://files.sparks.requester.pays/enwiki_category_text/ A client error (AccessDenied) occurred when calling the ListObjects operation: Access Denied Could you please take a look if it is still available for use ? Support sparse LDA solutions Key: SPARK-5564 URL: https://issues.apache.org/jira/browse/SPARK-5564 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently requires that the priors’ concentration parameters be 1.0. It should support values 0.0, which should encourage sparser topics (phi) and document-topic distributions (theta). For EM, this will require adding a projection to the M-step, as in: Vorontsov and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive Regularization for Stochastic Matrix Factorization. 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387085#comment-14387085 ] Joseph K. Bradley commented on SPARK-5564: -- [~debasish83] I mainly used a Wikipedia dataset. Here's an S3 bucket (requestor pays) which [~sparks] created: [s3://files.sparks.requester.pays/enwiki_category_text/] which holds a big Wikipedia dataset. I'm not sure if it's the same one I used, but it should be similar qualitatively. Mine had ~1.1 billion tokens, with about 1 million documents and 1 million terms (vocab size). As far as scaling, the EM code scaled linearly with the number of topics K. Communication was the bottleneck for sizable datasets, and it scales linearly with K. The largest K I've run with on that dataset was K=100; that was using 16 r3.2xlarge workers. Support sparse LDA solutions Key: SPARK-5564 URL: https://issues.apache.org/jira/browse/SPARK-5564 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently requires that the priors’ concentration parameters be 1.0. It should support values 0.0, which should encourage sparser topics (phi) and document-topic distributions (theta). For EM, this will require adding a projection to the M-step, as in: Vorontsov and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive Regularization for Stochastic Matrix Factorization. 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387180#comment-14387180 ] Debasish Das commented on SPARK-5564: - Cool...I will run my experiments on the same dataset as well and report results... Support sparse LDA solutions Key: SPARK-5564 URL: https://issues.apache.org/jira/browse/SPARK-5564 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently requires that the priors’ concentration parameters be 1.0. It should support values 0.0, which should encourage sparser topics (phi) and document-topic distributions (theta). For EM, this will require adding a projection to the M-step, as in: Vorontsov and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive Regularization for Stochastic Matrix Factorization. 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387232#comment-14387232 ] Joseph K. Bradley commented on SPARK-5564: -- No, I didn't use sparsity at all, so yours will be the first ones looking at reducing communication via sparsity (AFAIK). Support sparse LDA solutions Key: SPARK-5564 URL: https://issues.apache.org/jira/browse/SPARK-5564 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently requires that the priors’ concentration parameters be 1.0. It should support values 0.0, which should encourage sparser topics (phi) and document-topic distributions (theta). For EM, this will require adding a projection to the M-step, as in: Vorontsov and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive Regularization for Stochastic Matrix Factorization. 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386049#comment-14386049 ] Debasish Das commented on SPARK-5564: - [~josephkb] could you please point me to the datasets that are used for benchmarking? I have started testing loglikelihood loss for recommendation and since I already added the constraints, this is the right time to test it on LDA benchmarks as well...I will open up the code as part of https://issues.apache.org/jira/browse/SPARK-6323 as soon as our legal clears it... I am looking into LDA test-cases but since I am optimizing log-likelihood directly, I am looking to add more testcases from your LDA JIRA...For recommendation, I know how to construct the testcases... Support sparse LDA solutions Key: SPARK-5564 URL: https://issues.apache.org/jira/browse/SPARK-5564 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently requires that the priors’ concentration parameters be 1.0. It should support values 0.0, which should encourage sparser topics (phi) and document-topic distributions (theta). For EM, this will require adding a projection to the M-step, as in: Vorontsov and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive Regularization for Stochastic Matrix Factorization. 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14343476#comment-14343476 ] Joseph K. Bradley commented on SPARK-5564: -- It would be interesting to see comparisons between the two, but I don't have a good sense of which would be more efficient. {quote} I am assuming here that LDA architecture is a bipartite graph with nodes as docs/words and there are counts on each edge {quote} -- You're correct. Support sparse LDA solutions Key: SPARK-5564 URL: https://issues.apache.org/jira/browse/SPARK-5564 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently requires that the priors’ concentration parameters be 1.0. It should support values 0.0, which should encourage sparser topics (phi) and document-topic distributions (theta). For EM, this will require adding a projection to the M-step, as in: Vorontsov and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive Regularization for Stochastic Matrix Factorization. 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342311#comment-14342311 ] Debasish Das commented on SPARK-5564: - I am right now using the following PR to do large rank matrix factorization with various constraints...I am not sure if the current ALS will scale to large ranks but I will keen to compare the exact formulation in graphx based LDA flow... https://github.com/scalanlp/breeze/pull/364 Idea here is to solve the constrained factorization problem as explained in Vorontsov and Potapenko: minimize f(w,h*) s.t 1'w = 1, w =0 (row constraints) minimize f(w*,h) s.t 0 = h = 1, Normalize each column in h Here I want f(w,h) to be MAP loss but I already solved the least square variant in https://issues.apache.org/jira/browse/SPARK-2426 and got good improvement in MAP statistics...Here also I expect Perplexity will improve... If no one else is looking into it I would like to compare join based factorization based flow (ml.recommendation.ALS) with the graphx based LDA flow... Infact if you think for large ranks, LDA based flow will be more efficient than join based factorization flow, I can implement stochastic matrix factorization directly on top of LDA and add both the least square and MAP losses... Support sparse LDA solutions Key: SPARK-5564 URL: https://issues.apache.org/jira/browse/SPARK-5564 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently requires that the priors’ concentration parameters be 1.0. It should support values 0.0, which should encourage sparser topics (phi) and document-topic distributions (theta). For EM, this will require adding a projection to the M-step, as in: Vorontsov and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive Regularization for Stochastic Matrix Factorization. 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5564) Support sparse LDA solutions
[ https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342312#comment-14342312 ] Debasish Das commented on SPARK-5564: - By the way the following step is an approximation to the real constraint but if we get good results over Gibbs Sampling based approaches, there are ways to solve the real problem as well... minimize f(w*,h) s.t 0 = h = 1, Normalize each column in h Support sparse LDA solutions Key: SPARK-5564 URL: https://issues.apache.org/jira/browse/SPARK-5564 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently requires that the priors’ concentration parameters be 1.0. It should support values 0.0, which should encourage sparser topics (phi) and document-topic distributions (theta). For EM, this will require adding a projection to the M-step, as in: Vorontsov and Potapenko. Tutorial on Probabilistic Topic Modeling : Additive Regularization for Stochastic Matrix Factorization. 2014. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org