GitHub user jkbradley opened a pull request: https://github.com/apache/spark/pull/4047
[SPARK-1405] [mllib] Latent Dirichlet Allocation (LDA) using EM **This PR introduces an API + simple implementation for Latent Dirichlet Allocation (LDA).** The [design doc for this PR](https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo) has been updated since I initially posted it. In particular, see the API and Planning for the Future sections. ## Goals * Settle on a public API which may eventually include: * more inference algorithms * more options / functionality * Have an initial easy-to-understand implementation which others may improve. * This is NOT intended to support every topic model out there. However, if there are suggestions for making this extensible or pluggable in the future, that could be nice, as long as it does not complicate the API or implementation too much. * This may not be very scalable currently. It will be important to check and improve accuracy. For correctness of the implementation, please check against the Asuncion et al. (2009) paper in the design doc. ## Sketch of contents of this PR **Dependency: This makes MLlib depend on GraphX.** Files and classes: * LDA.scala (441 lines): * class LDA (main estimator class) * LDA.Document (text + document ID) * LDAModel.scala (266 lines) * abstract class LDAModel * class LocalLDAModel * class DistributedLDAModel * LDAExample.scala (245 lines): script to run LDA + a simple (private) Tokenizer * LDASuite.scala (144 lines) Data/model representation and algorithm: * Data/model: Uses GraphX, with term vertices + document vertices * Algorithm: EM, following [Asuncion, Welling, Smyth, and Teh. "On Smoothing and Inference for Topic Models." UAI, 2009.](http://arxiv-web3.library.cornell.edu/abs/1205.2662v1) * For more details, please see the description in the âDEVELOPERS NOTEâ in LDA.scala ## Design notes Please refer to the JIRA for more discussion + the [design doc for this PR](https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo) Here, I list the main changes AFTER the design doc was posted. Design decisions: * logLikelihood() computes the log likelihood of the data and the current point estimate of parameters. This is different from the likelihood of the data given the hyperparameters, which would be harder to compute. Iâd describe the current approach as more frequentist, whereas the harder approach would be more Bayesian. * The current API takes Documents as token count vectors. I believe there should be an extended API taking RDD[String] or RDD[Array[String]] in a future PR. I have sketched this out in the design doc (as well as handier versions of getTopics returning Strings). * Hyperparameters should be set differently for different inference/learning algorithms. See Asuncion et al. (2009) in the design doc for a good demonstration. I encourage good behavior via defaults and warning messages. Items planned for future PRs: * perplexity * API taking Strings ## Questions for reviewers * Should LDA be called LatentDirichletAllocation (and LDAModel be LatentDirichletAllocationModel)? * Pro: We may someday want LinearDiscriminantAnalysis. * Con: Very long names * Should LDA reside in clustering? Or do we want a sub-package? * mllib.topicmodel * mllib.clustering.topicmodel * Does the API seem reasonable and extensible? * Unit tests: * Should there be a test which checks a clustering results? E.g., train on a small, fake dataset with 2 very distinct topics/clusters, and ensure LDA finds those 2 topics/clusters. Does that sound useful or too flaky? ## Other notes This has not been tested much for scaling. I have run it on a laptop for 200 iterations on a 5MB dataset with 1000 terms and 5 topics. Running it for 500 iterations made it fail because of GC problems. Future PRs will need to improve the scaling. ## Thanks to⦠* @dlwh for the initial implementation * + @jegonzal for some code in the initial implementation * The many contributors towards topic model implementations in Spark which were referenced as a basis for this PR: @akopich @witgo @yinxusen @dlwh @EntilZha @jegonzal @IlyaKozlov CC: @mengxr You can merge this pull request into a Git repository by running: $ git pull https://github.com/jkbradley/spark davidhall-lda Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4047.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4047 ---- commit 186eba2736679cdb4072d37fcad296647c2ec1e2 Author: Joseph K. Bradley <jos...@databricks.com> Date: 2014-12-16T23:58:36Z Added 3 files from dlwh LDA implementation commit 087d81d73b9c98e2e087005c896d184fe95b7431 Author: Joseph K. Bradley <jos...@databricks.com> Date: 2015-01-12T20:34:32Z Prepped LDA main class for PR, but some cleanups remain commit 724e2cff12671ed21ac7d719570732b5a7eca96a Author: Joseph K. Bradley <jos...@databricks.com> Date: 2015-01-13T19:32:12Z cleanups before PR commit 10bf4d6b2f10b2bd7bda1ec9eb270ee60ad9a6b8 Author: Joseph K. Bradley <jos...@databricks.com> Date: 2015-01-14T00:45:06Z separated LDA models into own file. more cleanups before PR commit c6e430867ca32ca6f409f953a2d47dd04a1e6e53 Author: Joseph K. Bradley <jos...@databricks.com> Date: 2015-01-14T18:17:20Z Unit tests and fixes for LDA, now ready for PR ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org