Hi, I don't have practical experience with the MLlib LDA implementation, but regarding the variations in the topic matrix: LDA make use of stochastic processes. If you use setSeed(seed) with the same value for seed during initialization, your results should be identical though.
May I ask what exactly you refer to with prediction? Topic assignments (inference)? Best, Carsten Am 11.09.2015 um 15:29 schrieb Marko Asplund: > Hi, > > We're considering using Spark MLlib (v >= 1.5) LDA implementation for > topic modelling. We plan to train the model using a data set of about 12 > M documents and vocabulary size of 200-300 k items. Documents are > relatively short, typically containing less than 10 words, but the > number can range up to tens of words. The model would be updated > periodically using e.g. a batch process while predictions will be > queried by a long-running application process in which we plan to embed > MLlib. > > Is the MLlib LDA implementation considered to be well-suited to this > kind of use case? > > I did some prototyping based on the code samples on "MLlib - Clustering" > page and noticed that the topics matrix values seem to vary quite a bit > across training runs even with the exact same input data set. During > prediction I observed similar behaviour. > Is this due to the probabilistic nature of the LDA algorithm? > > Any caveats to be aware of with the LDA implementation? > > For reference, my prototype code can be found here: > https://github.com/marko-asplund/tech-protos/blob/master/mllib-lda/src/main/scala/fi/markoa/proto/mllib/LDADemo.scala > > > thanks, > marko -- Carsten Schnober Doctoral Researcher Ubiquitous Knowledge Processing (UKP) Lab FB 20 / Computer Science Department Technische Universität Darmstadt Hochschulstr. 10, D-64289 Darmstadt, Germany phone [+49] (0)6151 16-6227, fax -5455, room S2/02/B111 schno...@ukp.informatik.tu-darmstadt.de www.ukp.tu-darmstadt.de Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources (AIPHES): www.aiphes.tu-darmstadt.de PhD program: Knowledge Discovery in Scientific Literature (KDSL) www.kdsl.tu-darmstadt.de --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org