I helped some with the LDA and worked quite a bit on a Gibbs version. I don't know if the Gibbs version might help, but since it is not (yet) in MLlib, Intel Analytics kindly created a spark package with their adapted version plus a couple other LDA algorithms: http://spark-packages.org/package/intel-analytics/TopicModeling https://github.com/intel-analytics/TopicModeling
It might be worth trying out. Do you know what LDA algorithm VW uses? Pedro On Tue, Sep 22, 2015 at 1:54 AM, Marko Asplund <marko.aspl...@gmail.com> wrote: > Hi, > > I did some profiling for my LDA prototype code that requests topic > distributions from a model. > According to Java Mission Control more than 80 % of execution time during > sample interval is spent in the following methods: > > org.apache.commons.math3.util.FastMath.log(double); count: 337; 47.07% > org.apache.commons.math3.special.Gamma.digamma(double); count: 164; 22.91% > org.apache.commons.math3.util.FastMath.log(double, double[]); count: 50; > 6.98% > java.lang.Double.valueOf(double); count: 31; 4.33% > > Is there any way of using the API more optimally? > Are there any opportunities for optimising the "topicDistributions" code > path in MLlib? > > My code looks like this: > > // executed once > val model = LocalLDAModel.load(ctx, ModelFileName) > > // executed four times > val samples = Transformers.toSparseVectors(vocabularySize, > ctx.parallelize(Seq(input))) // fast > model.topicDistributions(samples.zipWithIndex.map(_.swap)) // <== this > seems to take about 4 seconds to execute > > > marko > -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC Berkeley AMPLab Alumni ski.rodrig...@gmail.com | pedrorodriguez.io | 208-340-1703 Github: github.com/EntilZha | LinkedIn: https://www.linkedin.com/in/pedrorodriguezscience