Hi,
I don't have practical experience with the MLlib LDA implementation, but
regarding the variations in the topic matrix: LDA make use of stochastic
processes. If you use setSeed(seed) with the same value for seed during
initialization, your results should be identical though.

May I ask what exactly you refer to with prediction? Topic assignments
(inference)?

Best,
Carsten


Am 11.09.2015 um 15:29 schrieb Marko Asplund:
> Hi,
> 
> We're considering using Spark MLlib (v >= 1.5) LDA implementation for
> topic modelling. We plan to train the model using a data set of about 12
> M documents and vocabulary size of 200-300 k items. Documents are
> relatively short, typically containing less than 10 words, but the
> number can range up to tens of words. The model would be updated
> periodically using e.g. a batch process while predictions will be
> queried by a long-running application process in which we plan to embed
> MLlib.
> 
> Is the MLlib LDA implementation considered to be well-suited to this
> kind of use case?
> 
> I did some prototyping based on the code samples on "MLlib - Clustering"
> page and noticed that the topics matrix values seem to vary quite a bit
> across training runs even with the exact same input data set. During
> prediction I observed similar behaviour.
> Is this due to the probabilistic nature of the LDA algorithm?
> 
> Any caveats to be aware of with the LDA implementation?
> 
> For reference, my prototype code can be found here:
> https://github.com/marko-asplund/tech-protos/blob/master/mllib-lda/src/main/scala/fi/markoa/proto/mllib/LDADemo.scala
> 
> 
> thanks,
> marko

-- 
Carsten Schnober
Doctoral Researcher
Ubiquitous Knowledge Processing (UKP) Lab
FB 20 / Computer Science Department
Technische Universität Darmstadt
Hochschulstr. 10, D-64289 Darmstadt, Germany
phone [+49] (0)6151 16-6227, fax -5455, room S2/02/B111
schno...@ukp.informatik.tu-darmstadt.de
www.ukp.tu-darmstadt.de

Web Research at TU Darmstadt (WeRC): www.werc.tu-darmstadt.de
GRK 1994: Adaptive Preparation of Information from Heterogeneous Sources
(AIPHES): www.aiphes.tu-darmstadt.de
PhD program: Knowledge Discovery in Scientific Literature (KDSL)
www.kdsl.tu-darmstadt.de

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to