[
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14149623#comment-14149623
]
Xiangrui Meng commented on SPARK-1405:
--------------------------------------
[~pedrorodriguez] Thanks for the update and sharing the timeline! By "standard
implementation" I mean LDA with Gibbs Sampling without special optimization,
but still using GraphX. We should make the first PR simple, so it can pass the
code review before the feature freeze for 1.2 (end of Oct). I had an offline
discussion with [~gq]. He will do some performance comparison of existing LDA
implementations. Once we have the numbers, let's pick one design and work
together.
Could you add http://jmlr.org/proceedings/papers/v36/qiu14.pdf to your design
doc? The partitioning scheme they used is interesting, which we can explore in
the future.
> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -----------------------------------------------------------------
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
> Reporter: Xusen Yin
> Assignee: Guoqiang Li
> Labels: features
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts
> topics from text corpus. Different with current machine learning algorithms
> in MLlib, instead of using optimization algorithms such as gradient desent,
> LDA uses expectation algorithms such as Gibbs sampling.
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene),
> and a Gibbs sampling core.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]