[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302952#comment-14302952
 ] 

yuhao yang commented on SPARK-1405:
-----------------------------------

Hi everyone, I'm sharing an implementation of [Online 
LDA|https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf] at 
https://github.com/hhbyyh/OnlineLDA_Spark, and hope it can be helpful for 
anyone interested.

The work is based on the research from [Matt 
Hoffman|http://www.cs.princeton.edu/~mdhoffma/] and [David M. 
Blei|http://www.cs.princeton.edu/~blei/topicmodeling.html]. Based on its online 
nature, the algorithm 
1. scans the corpus (doc sets) only once. Thus it {quote}needs not locally 
store or collect the documents and can be handily applied to streaming document 
collections. {quote}
2. breaks the massive corps into mini batches and takes one batch at a time, 
which downgrades memory and time consumption.
3. approximates the posterior as well as traditional approaches. (generate 
comparable or better results).

In demo runs, current implementation (with many details to be improved)
1. processed 8 millions short articles (Stackoverflow posts titles, avg length 
9, K=10) in 15 minutes.
2. processed entire English wiki dump set (5876K documents , avg length ~900 
words/per doc, 30G on disk, K=10) in 2 hours and 17 minutes 
using a 4-node cluster(20G memory, can be much less)

Trial and suggestions are most welcome!

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -----------------------------------------------------------------
>
>                 Key: SPARK-1405
>                 URL: https://issues.apache.org/jira/browse/SPARK-1405
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Xusen Yin
>            Assignee: Joseph K. Bradley
>            Priority: Critical
>              Labels: features
>             Fix For: 1.3.0
>
>         Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to