GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/4047

    [SPARK-1405] [mllib] Latent Dirichlet Allocation (LDA) using EM

    **This PR introduces an API + simple implementation for Latent Dirichlet 
Allocation (LDA).**
    
    The [design doc for this 
PR](https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo)
 has been updated since I initially posted it.  In particular, see the API and 
Planning for the Future sections.
    
    ## Goals
    
    * Settle on a public API which may eventually include:
      * more inference algorithms
      * more options / functionality
    * Have an initial easy-to-understand implementation which others may 
improve.
    * This is NOT intended to support every topic model out there.  However, if 
there are suggestions for making this extensible or pluggable in the future, 
that could be nice, as long as it does not complicate the API or implementation 
too much.
    * This may not be very scalable currently.  It will be important to check 
and improve accuracy.  For correctness of the implementation, please check 
against the Asuncion et al. (2009) paper in the design doc.
    
    ## Sketch of contents of this PR
    
    **Dependency: This makes MLlib depend on GraphX.**
    
    Files and classes:
    * LDA.scala (441 lines):
      * class LDA (main estimator class)
      * LDA.Document  (text + document ID)
    * LDAModel.scala (266 lines)
      * abstract class LDAModel
      * class LocalLDAModel
      * class DistributedLDAModel
    * LDAExample.scala (245 lines): script to run LDA + a simple (private) 
Tokenizer
    * LDASuite.scala (144 lines)
    
    Data/model representation and algorithm:
    * Data/model: Uses GraphX, with term vertices + document vertices
    * Algorithm: EM, following [Asuncion, Welling, Smyth, and Teh.  "On 
Smoothing and Inference for Topic Models."  UAI, 
2009.](http://arxiv-web3.library.cornell.edu/abs/1205.2662v1)
    * For more details, please see the description in the “DEVELOPERS NOTE” 
in LDA.scala
    
    ## Design notes
    
    Please refer to the JIRA for more discussion + the [design doc for this 
PR](https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo)
    
    Here, I list the main changes AFTER the design doc was posted.
    
    Design decisions:
    * logLikelihood() computes the log likelihood of the data and the current 
point estimate of parameters.  This is different from the likelihood of the 
data given the hyperparameters, which would be harder to compute.  I’d 
describe the current approach as more frequentist, whereas the harder approach 
would be more Bayesian.
    * The current API takes Documents as token count vectors.  I believe there 
should be an extended API taking RDD[String] or RDD[Array[String]] in a future 
PR.  I have sketched this out in the design doc (as well as handier versions of 
getTopics returning Strings).
    * Hyperparameters should be set differently for different 
inference/learning algorithms.  See Asuncion et al. (2009) in the design doc 
for a good demonstration.  I encourage good behavior via defaults and warning 
messages.
    
    Items planned for future PRs:
    * perplexity
    * API taking Strings
    
    ## Questions for reviewers
    
    * Should LDA be called LatentDirichletAllocation (and LDAModel be 
LatentDirichletAllocationModel)?
      * Pro: We may someday want LinearDiscriminantAnalysis.
      * Con: Very long names
    
    * Should LDA reside in clustering?  Or do we want a sub-package?
      * mllib.topicmodel
      * mllib.clustering.topicmodel
    
    * Does the API seem reasonable and extensible?
    
    * Unit tests:
      * Should there be a test which checks a clustering results?  E.g., train 
on a small, fake dataset with 2 very distinct topics/clusters, and ensure LDA 
finds those 2 topics/clusters.  Does that sound useful or too flaky?
    
    ## Other notes
    
    This has not been tested much for scaling.  I have run it on a laptop for 
200 iterations on a 5MB dataset with 1000 terms and 5 topics.  Running it for 
500 iterations made it fail because of GC problems.  Future PRs will need to 
improve the scaling.
    
    ## Thanks to…
    
    * @dlwh  for the initial implementation
      * + @jegonzal  for some code in the initial implementation
    * The many contributors towards topic model implementations in Spark which 
were referenced as a basis for this PR: @akopich @witgo @yinxusen @dlwh 
@EntilZha @jegonzal  @IlyaKozlov
    
    CC: @mengxr


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark davidhall-lda

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4047.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4047
    
----
commit 186eba2736679cdb4072d37fcad296647c2ec1e2
Author: Joseph K. Bradley <jos...@databricks.com>
Date:   2014-12-16T23:58:36Z

    Added 3 files from dlwh LDA implementation

commit 087d81d73b9c98e2e087005c896d184fe95b7431
Author: Joseph K. Bradley <jos...@databricks.com>
Date:   2015-01-12T20:34:32Z

    Prepped LDA main class for PR, but some cleanups remain

commit 724e2cff12671ed21ac7d719570732b5a7eca96a
Author: Joseph K. Bradley <jos...@databricks.com>
Date:   2015-01-13T19:32:12Z

    cleanups before PR

commit 10bf4d6b2f10b2bd7bda1ec9eb270ee60ad9a6b8
Author: Joseph K. Bradley <jos...@databricks.com>
Date:   2015-01-14T00:45:06Z

    separated LDA models into own file.  more cleanups before PR

commit c6e430867ca32ca6f409f953a2d47dd04a1e6e53
Author: Joseph K. Bradley <jos...@databricks.com>
Date:   2015-01-14T18:17:20Z

    Unit tests and fixes for LDA, now ready for PR

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to