[jira] [Updated] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

Xiangrui Meng (JIRA) Fri, 19 Feb 2016 10:10:44 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Xiangrui Meng updated SPARK-1405:
---------------------------------
    Description: 
Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics 
from text corpus. Different with current machine learning algorithms in MLlib, 
instead of using optimization algorithms such as gradient desent, LDA uses 
expectation algorithms such as Gibbs sampling. 

In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and 
a Gibbs sampling core.

Algorithm survey from Pedro: 
https://docs.google.com/document/d/13MfroPXEEGKgaQaZlHkg1wdJMtCN5d8aHJuVkiOrOK4/edit?usp=sharing
API design doc from Joseph: 
https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing

  was:
Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics 
from text corpus. Different with current machine learning algorithms in MLlib, 
instead of using optimization algorithms such as gradient desent, LDA uses 
expectation algorithms such as Gibbs sampling. 

In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and 
a Gibbs sampling core.


> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -----------------------------------------------------------------
>
>                 Key: SPARK-1405
>                 URL: https://issues.apache.org/jira/browse/SPARK-1405
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Xusen Yin
>            Assignee: Joseph K. Bradley
>            Priority: Critical
>              Labels: features
>             Fix For: 1.3.0
>
>         Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.
> Algorithm survey from Pedro: 
> https://docs.google.com/document/d/13MfroPXEEGKgaQaZlHkg1wdJMtCN5d8aHJuVkiOrOK4/edit?usp=sharing
> API design doc from Joseph: 
> https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

Reply via email to