[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

Debasish Das (JIRA) Wed, 19 Nov 2014 17:27:08 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218845#comment-14218845
 ]


Debasish Das commented on SPARK-1405:
-------------------------------------

I would like to compare the LSA formulations (sparse coding and PLSA with least 
square loss) from https://issues.apache.org/jira/browse/SPARK-2426 with LDA

I added MAP metric for examples.MovieLensALS in 
https://issues.apache.org/jira/browse/SPARK-4231 but I am not sure how MAP can 
be used for topic modeling datasets....we need some perplexity measures...

Could you guys please point me to the dataset and the quality measures that are 
being benchmarked on the LDA PR so that I can also test the LSA formulations in 
parallel ?


> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -----------------------------------------------------------------
>
>                 Key: SPARK-1405
>                 URL: https://issues.apache.org/jira/browse/SPARK-1405
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Xusen Yin
>            Assignee: Guoqiang Li
>            Priority: Critical
>              Labels: features
>         Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

Reply via email to