[ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14308603#comment-14308603
 ] 

Pedro Rodriguez commented on SPARK-5556:
----------------------------------------

Posting here as a status update. I will be working on and opening a pull 
request for adding a collapsed Gibbs sampling version which uses FastLDA for 
super linear scaling with number of topics. Below is the design document (same 
as from the original LDA JIRA issue), along with the repository/branch I am 
working on.
https://docs.google.com/document/d/13MfroPXEEGKgaQaZlHkg1wdJMtCN5d8aHJuVkiOrOK4/edit?usp=sharing

https://github.com/EntilZha/spark/tree/LDA-Refactor

Tasks
* Rebase from the merged implementation, refactor appropriately
* Merge/implement the required inheritance/trait/abstract classes to support 
two implementations (EM and Gibbs) using only the entry points exposed in the 
EM version, plus an optional argument to select between EM/Gibbs.
* Do performance tests comparable to those run for EM LDA.

Some details for inheritance/trait/abstract:
General idea would be to create an API which LDA implementations must satisfy 
using a trait/abstract class. All implementation details would be encapsulated 
within a state object satisfying the trait/abstract class. LDA would be 
responsible for creating an EM or Gibbs state object based on a user argument 
switch/flag. Linked below is a sample implementation based on an earlier 
version of the merged EM code (which needs to be updated to reflect the changes 
since then, but it should show the idea well enough):
https://github.com/EntilZha/spark/blob/LDA-Refactor/mllib/src/main/scala/org/apache/spark/mllib/topicmodeling/LDA.scala#L216-L242

Timeline: I have been busier than expected, but rebase/refactoring should be 
done in the next few days, then I will open a PR to get feedback while running 
performance tests.

> Latent Dirichlet Allocation (LDA) using Gibbs sampler 
> ------------------------------------------------------
>
>                 Key: SPARK-5556
>                 URL: https://issues.apache.org/jira/browse/SPARK-5556
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Guoqiang Li
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to