[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14308603#comment-14308603 ]
Pedro Rodriguez commented on SPARK-5556: ---------------------------------------- Posting here as a status update. I will be working on and opening a pull request for adding a collapsed Gibbs sampling version which uses FastLDA for super linear scaling with number of topics. Below is the design document (same as from the original LDA JIRA issue), along with the repository/branch I am working on. https://docs.google.com/document/d/13MfroPXEEGKgaQaZlHkg1wdJMtCN5d8aHJuVkiOrOK4/edit?usp=sharing https://github.com/EntilZha/spark/tree/LDA-Refactor Tasks * Rebase from the merged implementation, refactor appropriately * Merge/implement the required inheritance/trait/abstract classes to support two implementations (EM and Gibbs) using only the entry points exposed in the EM version, plus an optional argument to select between EM/Gibbs. * Do performance tests comparable to those run for EM LDA. Some details for inheritance/trait/abstract: General idea would be to create an API which LDA implementations must satisfy using a trait/abstract class. All implementation details would be encapsulated within a state object satisfying the trait/abstract class. LDA would be responsible for creating an EM or Gibbs state object based on a user argument switch/flag. Linked below is a sample implementation based on an earlier version of the merged EM code (which needs to be updated to reflect the changes since then, but it should show the idea well enough): https://github.com/EntilZha/spark/blob/LDA-Refactor/mllib/src/main/scala/org/apache/spark/mllib/topicmodeling/LDA.scala#L216-L242 Timeline: I have been busier than expected, but rebase/refactoring should be done in the next few days, then I will open a PR to get feedback while running performance tests. > Latent Dirichlet Allocation (LDA) using Gibbs sampler > ------------------------------------------------------ > > Key: SPARK-5556 > URL: https://issues.apache.org/jira/browse/SPARK-5556 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: Guoqiang Li > -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org