[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131299#comment-14131299 ]
Xiangrui Meng commented on SPARK-1405: -------------------------------------- [~xusen] and [~gq] Thanks for working on LDA! The major feedback of your implementations is the how models are stored. [~josephkb] and I had an offline discussion with Evan and Joey (AMPLab) on LDA's interface and implementation. For the input data, we recommend `RDD[(Int, Vector)]`, where each pair consists of the document id and its word distribution, which may come from a text vectorizer. For the output model, because the LDA model is huge (W*K + D*K), where W is the number of words, D is the number of documents, and K is the number of topics, we should store the model distributively for better scalability, e.g., in RDD[(Int, Vector)], or using Long for ids. Joey already had an LDA implementation using GraphX: https://github.com/jegonzal/graphx/blob/LDA/graph/src/main/scala/org/apache/spark/graph/algorithms/TopicModeling.scala With GraphX, we can treat documents and words as graph nodes and topic assignments as edges. The code is easy to understand. There is also a paper describing a distributed implementation of LDA on Spark that uses a DGSD-like partitioning of the doc-word matrix: http://jmlr.org/proceedings/papers/v36/qiu14.pdf Anyone interested in helping test those implementations? > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > ----------------------------------------------------------------- > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: Xusen Yin > Assignee: Xusen Yin > Labels: features > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org