[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14738090#comment-14738090 ] ding commented on SPARK-5556: - We have made the spark package and it can be find here http://spark-packages.org/package/intel-analytics/TopicModeling > Latent Dirichlet Allocation (LDA) using Gibbs sampler > -- > > Key: SPARK-5556 > URL: https://issues.apache.org/jira/browse/SPARK-5556 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Pedro Rodriguez > Attachments: LDA_test.xlsx, spark-summit.pptx > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704885#comment-14704885 ] Pedro Rodriguez commented on SPARK-5556: That is awesome. I've been a bit busy moving/starting a PhD at CU Boulder so sorry for not responding sooner. The good news is that I will very likely be working on LDA related research :) I haven't made more progress on the LDA package beyond creating the github repo and spark package (on the packages website). I will take a look at the repo this week. It would be great if what you have worked on can be made a package. Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez Attachments: LDA_test.xlsx, spark-summit.pptx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705810#comment-14705810 ] Jason Dai commented on SPARK-5556: -- [~pedrorodriguez] We'll try to make a spark package based on our repo; please help take a look at the code and provide your feedback. Please let us know if there are anything we may collaborate for LDA/topic modeling on Spark. Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez Attachments: LDA_test.xlsx, spark-summit.pptx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696677#comment-14696677 ] ding commented on SPARK-5556: - The code can be found https://github.com/intel-analytics/TopicModeling. There is an example in the package, you can try gibbs sampling lda or online lda by setting --optimizer as gibbs or online Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez Attachments: LDA_test.xlsx, spark-summit.pptx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14693579#comment-14693579 ] Jason Dai commented on SPARK-5556: -- Sure; we can share our code on github, and then try to make a Spark package :-) Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez Attachments: LDA_test.xlsx, spark-summit.pptx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692480#comment-14692480 ] Joseph K. Bradley commented on SPARK-5556: -- Wow that sounds awesome. Would you be able to share your code as a Spark package? Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez Attachments: LDA_test.xlsx, spark-summit.pptx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692469#comment-14692469 ] Jason Dai commented on SPARK-5556: -- [~pedrorodriguez] I wonder if you have made any progress on the LDA package. We have actually built a package of topic modeling algorithms for our use cases, which contains Gibbs Sampling LDA (adapted to the MLlib LDA interface based on the PRs/codes from [~pedrorodriguez] and [~gq], including AliasLDA, SparseLDA, LightLDA and FastLDA algorithms), as well as Hierarchical LDA (implemented by [~yuhaoyan] from our team). We can also share that package for people to try different algorithms. Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez Attachments: LDA_test.xlsx, spark-summit.pptx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615616#comment-14615616 ] Pedro Rodriguez commented on SPARK-5556: I am still interested, but was unsure of the status of other implementations. Given not much new, perhaps I should go ahead with it? Last week I was also considering the possibility of making a Spark package for LDA. The aims would be threefold: have more algorithms (I have been contact by a couple researchers basing their work on the Gibbs LDA I worked on, plus I will likely be using it in my own PhD starting this fall), a good place for relatively new/unproven variants, and then pull the best into spark. I have been pretty busy so haven't gotten around to that, but it has been on my mind. When I get time this week, I will take a look at the current source and what I have to see how much work it would take to get to something that could make a pull request. Although FastLDA/LightLDA might be algorithmically better, I think that what I have would be a good starting place at least. Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez Attachments: LDA_test.xlsx, spark-summit.pptx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615610#comment-14615610 ] Joseph K. Bradley commented on SPARK-5556: -- [~pedrorodriguez] I'm going to remove the target version of this JIRA for now, but please do let me know if you're interested in picking this up again. We should definitely add Gibbs sampling in eventually. Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez Attachments: LDA_test.xlsx, spark-summit.pptx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615695#comment-14615695 ] Joseph K. Bradley commented on SPARK-5556: -- Right, I don't think there has been much change, though [~gq] could perhaps say more. I think starting as a Spark package would be great; as you say, it would allow testing of different variants. A reasonable timeline might be to add a Spark package around the 1.5 release and port the best version to Spark for the 1.5 release. Please ping if you'd like feedback. Thank you! Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez Attachments: LDA_test.xlsx, spark-summit.pptx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530029#comment-14530029 ] Guoqiang Li commented on SPARK-5556: [FastLDA|https://github.com/witgo/zen/blob/1c0f6c63a0b67569aeefba3f767acf1ac93c7a7c/ml/src/main/scala/com/github/cloudml/zen/ml/clustering/LDA.scala#L553]: Gibbs sampling,The computational complexity is O(n_dk), n_dk is the number of topic (unique) in document d. I recommend to be used for short text [LightLDA|https://github.com/witgo/zen/blob/1c0f6c63a0b67569aeefba3f767acf1ac93c7a7c/ml/src/main/scala/com/github/cloudml/zen/ml/clustering/LDA.scala#L763] Metropolis Hasting sampling The computational complexity is O(1)(It depends on the partition strategy and takes up more memory). Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez Attachments: LDA_test.xlsx, spark-summit.pptx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1452#comment-1452 ] Pedro Rodriguez commented on SPARK-5556: What are thoughts on implementation?It looks like LightLDA converges faster and takes more memory, but FastLDA is slightly faster. Could you give a good summary of comparing the different algorithms [~gq]? I went through the data and plots, but some interpretation would be great. How should we move forward on choosing an implementation? It makes more sense to decide on something, then work on merging that choice rather than preparing multiple choices. On my Gibbs implementation, I am working on the assumption that algorithmically it is the same as [~gq]'s and should perform comparably. Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez Attachments: LDA_test.xlsx, spark-summit.pptx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518602#comment-14518602 ] Guoqiang Li commented on SPARK-5556: I put the latest LDA code in [Zen|https://github.com/witgo/zen/tree/master/ml/src/main/scala/com/github/cloudml/zen/ml/clustering] The test results [here|https://issues.apache.org/jira/secure/attachment/12729030/LDA_test.xlsx] (72 cores, 216G ram, 6 servers, Gigabit Ethernet) Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez Attachments: LDA_test.xlsx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518378#comment-14518378 ] Pedro Rodriguez commented on SPARK-5556: I will start working on it again then. It would be great for that research project to result in Gibbs being added. The refactoring ended up roadblocking that quite a bit. I think [~gq] was working on something called LightLDA. I don't know the specifics of the algorithm, but the sampler scales theoretically O(1) with topics. My implementation has something which in the testing I did looks like in practice it is O(1) or very near it. To get Gibbs merged in (or as a candidate implementation), how does this look: 1. Refactor code to fit the PR that you just merged 2. Use the testing harness you used for the EM LDA to test with the same conditions. This should be fairly easy since you already did all the work to get things pipelining correctly. 3. If it scales well, then merge or consider other applications 4. Code review somewhere in there. Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518601#comment-14518601 ] Pedro Rodriguez commented on SPARK-5556: [~gq] is the LDAGibbs line what I implemented or something else? In any case, the optimization on sampling shouldn't change the results, so it looks like LightLDA converges to a better perplexity. Do you have any performance graphs? Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez Attachments: LDA_test.xlsx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518400#comment-14518400 ] Joseph K. Bradley commented on SPARK-5556: -- That plan sounds good. I haven't yet been able to look into LightLDA, but it would be good to understand if it's (a) a modification which could be added to Gibbs later on or (b) an algorithm which belongs as a separate algorithm. Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518618#comment-14518618 ] Guoqiang Li commented on SPARK-5556: LDA_Gibbs combines the advantages of AliasLDA, FastLDA and SparseLDA algorithm. The corresponding code is https://github.com/witgo/spark/tree/lda_Gibbs or https://github.com/witgo/zen/blob/master/ml/src/main/scala/com/github/cloudml/zen/ml/clustering/LDA.scala#L553. Yes LightLDA converge faster,but it takes up more memory Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez Attachments: LDA_test.xlsx, spark-summit.pptx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518621#comment-14518621 ] Guoqiang Li commented on SPARK-5556: [spark-summit.pptx|https://issues.apache.org/jira/secure/attachment/12729035/spark-summit.pptx] has introduced the relevant algorithm Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez Attachments: LDA_test.xlsx, spark-summit.pptx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518141#comment-14518141 ] Joseph K. Bradley commented on SPARK-5556: -- Great! I'm not aware of blockers. As far as other active implementations, the only ones I know of are those reference by [~gq] above. Please do ping him on your work and see if there are ideas which can be merged. We can help with the coordination and discussions as well. Thanks! Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518133#comment-14518133 ] Pedro Rodriguez commented on SPARK-5556: With the refactoring done, I can get to working on getting the core code running on that interface. Does it seem likely if that is completed, gibbs will get merged for 1.5. Are there any foreseeable blockers or potential different implementations that are being considered? Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516072#comment-14516072 ] Joseph K. Bradley commented on SPARK-5556: -- I updated the target version for this. I hope we can get it in for the next release! Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339847#comment-14339847 ] Guoqiang Li commented on SPARK-5556: [This branch|https://github.com/witgo/spark/tree/lda_Gibbs]'s computational complexity is O(Ndk), is the number of topic (unique) in document d Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339772#comment-14339772 ] Pedro Rodriguez commented on SPARK-5556: See PR for info, TLDR: contains refactoring for multiple LDA algorithms, including how EM would be refactored. Will in the near future contain Gibbs implementation I have/had been working on. Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339767#comment-14339767 ] Apache Spark commented on SPARK-5556: - User 'EntilZha' has created a pull request for this issue: https://github.com/apache/spark/pull/4807 Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339849#comment-14339849 ] Pedro Rodriguez commented on SPARK-5556: Based on initial testing, I recall FastLDA in practice being O(1), should be able to confirm that at a larger scale test soon. LightLDA definitely worth looking into I think, at this point though my focus is on getting the FastLDA Gibbs to a mergable state (tests pass, refactoring/api for LDA is good, and performs at scale as good as or better than EM). Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308615#comment-14308615 ] Guoqiang Li commented on SPARK-5556: LightLDA's computational complexity is O(1) The paper: http://arxiv.org/abs/1412.1576 The code(work in progress): https://github.com/witgo/spark/tree/LightLDA Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308603#comment-14308603 ] Pedro Rodriguez commented on SPARK-5556: Posting here as a status update. I will be working on and opening a pull request for adding a collapsed Gibbs sampling version which uses FastLDA for super linear scaling with number of topics. Below is the design document (same as from the original LDA JIRA issue), along with the repository/branch I am working on. https://docs.google.com/document/d/13MfroPXEEGKgaQaZlHkg1wdJMtCN5d8aHJuVkiOrOK4/edit?usp=sharing https://github.com/EntilZha/spark/tree/LDA-Refactor Tasks * Rebase from the merged implementation, refactor appropriately * Merge/implement the required inheritance/trait/abstract classes to support two implementations (EM and Gibbs) using only the entry points exposed in the EM version, plus an optional argument to select between EM/Gibbs. * Do performance tests comparable to those run for EM LDA. Some details for inheritance/trait/abstract: General idea would be to create an API which LDA implementations must satisfy using a trait/abstract class. All implementation details would be encapsulated within a state object satisfying the trait/abstract class. LDA would be responsible for creating an EM or Gibbs state object based on a user argument switch/flag. Linked below is a sample implementation based on an earlier version of the merged EM code (which needs to be updated to reflect the changes since then, but it should show the idea well enough): https://github.com/EntilZha/spark/blob/LDA-Refactor/mllib/src/main/scala/org/apache/spark/mllib/topicmodeling/LDA.scala#L216-L242 Timeline: I have been busier than expected, but rebase/refactoring should be done in the next few days, then I will open a PR to get feedback while running performance tests. Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308619#comment-14308619 ] Pedro Rodriguez commented on SPARK-5556: I will read that paper, seems interesting. Probably worth discussing at some point, how is the philosophy behind supporting different algorithms? It seems like there are a good number (at least 2 Gibbs, 1 EM right now). On the same line of thought, perhaps it would be better to open two pull requests, one which refactors the current LDA to allow multiple algorithms, and a second for the Gibbs itself? Thoughts? Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14303737#comment-14303737 ] Joseph K. Bradley commented on SPARK-5556: -- I believe [~mengxr] and [~witgo] have confirmed that's the plan. It would be great to get this as another algorithm option; I suspect it will perform better than EM. One thought: There are several possibilities for Gibbs sampling algorithms: * Collapsed Gibbs sampling (most common, but distributed implementations are all non-ergodic) (This is what [~witgo]'s PR uses, I believe.) * Non-collapsed Gibbs sampling (not common, but easy to make an ergodic distributed implementation) * Modified versions of the LDA model designed for distributed computation, such as HD-LDA in Newman et al. “Distributed Algorithms for Topic Models.” JMLR, 2009. Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14303074#comment-14303074 ] Sean Owen commented on SPARK-5556: -- To clarify, since I had to double-check too, the LDA implemented in SPARK-1405 uses EM, so this tracks adding an implementation based on Gibbs sampling using that same code base? Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org