Hi, We're considering using Spark MLlib (v >= 1.5) LDA implementation for topic modelling. We plan to train the model using a data set of about 12 M documents and vocabulary size of 200-300 k items. Documents are relatively short, typically containing less than 10 words, but the number can range up to tens of words. The model would be updated periodically using e.g. a batch process while predictions will be queried by a long-running application process in which we plan to embed MLlib.
Is the MLlib LDA implementation considered to be well-suited to this kind of use case? I did some prototyping based on the code samples on "MLlib - Clustering" page and noticed that the topics matrix values seem to vary quite a bit across training runs even with the exact same input data set. During prediction I observed similar behaviour. Is this due to the probabilistic nature of the LDA algorithm? Any caveats to be aware of with the LDA implementation? For reference, my prototype code can be found here: https://github.com/marko-asplund/tech-protos/blob/master/mllib-lda/src/main/scala/fi/markoa/proto/mllib/LDADemo.scala thanks, marko