[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148529#comment-14148529 ]
Xiangrui Meng commented on SPARK-1405: -------------------------------------- [~Guoqiang Li] and [~pedrorodriguez], since there are already 4~5 implementations of LDA on Spark and [~dlwh] is also interested in one with partial labels, we do need to coordinate to avoid duplicate effort. I think the TODOs are: 0. Make progress updates frequently. 1. Test Joey's implementation and Guoqiang's (both on GraphX) on some common datasets. We also need to verify the correctness of the output, by comparing the result with some single machine solvers. 2. Discuss the public APIs in MLlib. Because GraphX is an alpha component, we should not expose GraphX APIs in MLlib. See my previous comments on the input and model types. 3. Have a standard implementation of LDA with Gibbs Sampling in MLlib. The target is v1.2, which means it should be merged by the end of Nov. Improvements can be made in future releases. Could you share your timeline? Thanks! > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > ----------------------------------------------------------------- > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: Xusen Yin > Assignee: Xusen Yin > Labels: features > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org