[
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148555#comment-14148555
]
Pedro Rodriguez commented on SPARK-1405:
----------------------------------------
[~mengxr], definitely a good idea to be coordinated about it. I have been
working with Evan so have been giving status updates and making todos with him.
I will post here on progress updates as well.
I have been working on creating a design doc/reference which you can find here:
https://docs.google.com/document/d/13MfroPXEEGKgaQaZlHkg1wdJMtCN5d8aHJuVkiOrOK4/edit?usp=sharing
It is for a large part, a way for us/me to keep notes while working, but I
would like to take some of it and convert it to documentation. It primarily
contains
1. Relevant links to papers/code/repositories
2. Thorough explanation/documentation of LDA and motivation behind the graph
implementation (Joey's version)
3. Testing steps (which data sets on what)
4. Current todos (perhaps we should post them here primarily and update doc for
consistency).
1. Currently I am working on testing (unit test functions and correctness
testing), refactoring, and extending Joey's implementation. The objective for
this week is to have the mini-test running (a set of ~10 documents which acts
as a sanity check). Goal for early next week is to be running on NIPS. I think
the majority of time to get there will be putting the dataset in a parseable
format (remove equations, stop words...) and insuring that the result looks
correct.
To that end, we plan on running the same datasets through Graphlab for
benchmarking machine/ML performance and a python implementation for ML
performance/correctness.
Once we are there, the plan is to start looking at running on wikipedia.
2. The code I am currently working on lives here:
https://github.com/EntilZha/spark
https://github.com/EntilZha/spark/blob/LDA/graphx/src/main/scala/org/apache/spark/graphx/lib/TopicModeling.scala
which is within GraphX, with the other graph based algorithms.
3. Prior to knowing about Joey's graph implementation, I wrote my own for a
final project. I stopped working on it since the graph implementation should be
more performant. Probably a good point of discussion if there should be a
"standard" and graph implementation together. When you reference standard
implementation, is there a particular implementation you are referring to that
I can look at?
TLDR timeline:
End of this week: mini-dataset for sanity check + refactoring code + unit
testing
Next week: Format NIPS for input + run NIPS data set on Spark, GraphLab, and
Python LDA. I will be away from Berkeley at a conference, but hope to still get
those done.
>From there, we would like to get running on larger datasets for performance
>testing.
> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -----------------------------------------------------------------
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
> Reporter: Xusen Yin
> Assignee: Guoqiang Li
> Labels: features
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts
> topics from text corpus. Different with current machine learning algorithms
> in MLlib, instead of using optimization algorithms such as gradient desent,
> LDA uses expectation algorithms such as Gibbs sampling.
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene),
> and a Gibbs sampling core.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]