[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

Xiangrui Meng (JIRA) Fri, 26 Sep 2014 09:50:44 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14149623#comment-14149623
 ]


Xiangrui Meng commented on SPARK-1405:
--------------------------------------

[~pedrorodriguez] Thanks for the update and sharing the timeline! By "standard 
implementation" I mean LDA with Gibbs Sampling without special optimization, 
but still using GraphX. We should make the first PR simple, so it can pass the 
code review before the feature freeze for 1.2 (end of Oct). I had an offline 
discussion with [~gq]. He will do some performance comparison of existing LDA 
implementations. Once we have the numbers, let's pick one design and work 
together.

Could you add http://jmlr.org/proceedings/papers/v36/qiu14.pdf to your design 
doc? The partitioning scheme they used is interesting, which we can explore in 
the future.

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -----------------------------------------------------------------
>
>                 Key: SPARK-1405
>                 URL: https://issues.apache.org/jira/browse/SPARK-1405
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Xusen Yin
>            Assignee: Guoqiang Li
>              Labels: features
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

Reply via email to