Re: How to speed up MLlib LDA?

2015-09-22 Thread Charles Earl
It seems that the Vowpal Wabbit version is most similar to what is in https://github.com/intel-analytics/TopicModeling/blob/master/src/main/scala/org/apache/spark/mllib/topicModeling/OnlineHDP.scala Although the Intel seems to implement the Hierarchical Dirichlet Process (topics and subtopics) as

Re: How to speed up MLlib LDA?

2015-09-22 Thread Marko Asplund
How optimized are the Commons math3 methods that showed up in profiling? Are there any higher performance alternatives to these? marko

Re: How to speed up MLlib LDA?

2015-09-22 Thread Pedro Rodriguez
I helped some with the LDA and worked quite a bit on a Gibbs version. I don't know if the Gibbs version might help, but since it is not (yet) in MLlib, Intel Analytics kindly created a spark package with their adapted version plus a couple other LDA algorithms:

Re: How to speed up MLlib LDA?

2015-09-22 Thread Marko Asplund
Hi, I did some profiling for my LDA prototype code that requests topic distributions from a model. According to Java Mission Control more than 80 % of execution time during sample interval is spent in the following methods: org.apache.commons.math3.util.FastMath.log(double); count: 337; 47.07%

Re: How to speed up MLlib LDA?

2015-09-17 Thread Marko Asplund
Hi Feynman, I just tried that, but there wasn't a noticeable change in training performance. On the other hand model loading time was reduced to ~ 5 seconds from ~ 2 minutes (now persisted as LocalLDAModel). However, query / prediction time was unchanged. Unfortunately, this is the critical

How to speed up MLlib LDA?

2015-09-15 Thread Marko Asplund
that there are differences in the LDA implementations, but which parameters should I tweak to make the LDA implementations work with similar operational parameters and thus make the results more comparable? Any suggestions on how to speed up MLlib LDA and thoughts on speed-accuracy tradeoffs? The log includes

Re: How to speed up MLlib LDA?

2015-09-15 Thread Feynman Liang
Hi Marko, I haven't looked into your case in much detail but one immediate thought is: have you tried the OnlineLDAOptimizer? It's implementation and resulting LDA model (LocalLDAModel) is quite different (doesn't depend on GraphX, assumes the model fits on a single machine) so you may see

Re: How to speed up MLlib LDA?

2015-09-15 Thread Marko Asplund
While doing some more testing I noticed that loading the persisted model from disk (~2 minutes) as well as querying LDA model topic distributions (~4 seconds for one document) are quite slow operations. Our application is querying LDA model topic distribution (for one doc at a time) as part of