It seems that the Vowpal Wabbit version is most similar to what is in
https://github.com/intel-analytics/TopicModeling/blob/master/src/main/scala/org/apache/spark/mllib/topicModeling/OnlineHDP.scala
Although the Intel seems to implement the Hierarchical Dirichlet Process
(topics and subtopics) as
How optimized are the Commons math3 methods that showed up in profiling?
Are there any higher performance alternatives to these?
marko
I helped some with the LDA and worked quite a bit on a Gibbs version. I
don't know if the Gibbs version might help, but since it is not (yet) in
MLlib, Intel Analytics kindly created a spark package with their adapted
version plus a couple other LDA algorithms:
Hi,
I did some profiling for my LDA prototype code that requests topic
distributions from a model.
According to Java Mission Control more than 80 % of execution time during
sample interval is spent in the following methods:
org.apache.commons.math3.util.FastMath.log(double); count: 337; 47.07%
Hi Feynman,
I just tried that, but there wasn't a noticeable change in training
performance. On the other hand model loading time was reduced to ~ 5
seconds from ~ 2 minutes (now persisted as LocalLDAModel).
However, query / prediction time was unchanged.
Unfortunately, this is the critical
that there are differences in the LDA implementations, but which
parameters should I tweak to make the LDA implementations work with similar
operational parameters and thus make the results more comparable?
Any suggestions on how to speed up MLlib LDA and thoughts on speed-accuracy
tradeoffs?
The log includes
Hi Marko,
I haven't looked into your case in much detail but one immediate thought
is: have you tried the OnlineLDAOptimizer? It's implementation and
resulting LDA model (LocalLDAModel) is quite different (doesn't depend on
GraphX, assumes the model fits on a single machine) so you may see
While doing some more testing I noticed that loading the persisted model
from disk (~2 minutes) as well as querying LDA model topic distributions
(~4 seconds for one document) are quite slow operations.
Our application is querying LDA model topic distribution (for one doc at a
time) as part of