[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14153412#comment-14153412
 ] 

David Hall commented on SPARK-1405:
-----------------------------------

Hi everyone,

Sorry for taking so long for me to reply. As part of some contract work with 
Alpine, I've been working on yet another LDA implementation. We're actually 
implementing partially labeled lda[1], which is a strict generalization of LDA. 
The implementation is based on EM MAP inference, rather than Gibbs; EM has been 
shown to converge much more quickly (in number of iterations and wall time) and 
to better optima than Gibbs LDA[2]. It also has an interpretation when run in 
parallel. Collapsed Gibbs Sampling when run in parallel has no guarantees. EM 
is still guaranteed to converged to a local optimum.

I'll post the code as soon as I clear it with Alpine.

[1]http://nlp.stanford.edu/~dramage/papers/pldp-kdd11.pdf
[2] http://mimno.infosci.cornell.edu/info6150/readings/UAI_09.pdf

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -----------------------------------------------------------------
>
>                 Key: SPARK-1405
>                 URL: https://issues.apache.org/jira/browse/SPARK-1405
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Xusen Yin
>            Assignee: Guoqiang Li
>              Labels: features
>         Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to