In my usage of MLLib's LDA, I have noticed that repeated invocations of
LDAModel.transform() result in the duplication of a matrix derived from the
model's topic matrix. Because this derived matrix can be quite large
(imagine hundreds of topics, and vocabulary size in the tens or hundreds of
thousands of words), this duplication can create quite a memory burden on
the driver. In a streaming job, I was able to crash the driver with an Out
of Memory exception by invoking LDAModel.transform() frequently.

The root of the issue as I see it is at
org.apache.spark.mllib.clustering.LDAModel#getTopicDistributionMethod() (
https://github.com/apache/spark/blob/v2.4.0/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala#L394),
which is called every time by
org.apache.spark.ml.clustering.LDAModel#transform() (
https://github.com/apache/spark/blob/v2.4.0/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala#L458).
Within getTopicDistributionMethod(), a fresh expElogbeta matrix is
computed, rather than being drawn from a cache. When LDAModel.transform()
is called multiple times (e.g., in a streaming job), this expElogbeta
matrix ends up being reduplicated, potentially taking up significant
amounts of driver memory. This is true even if the transform() computations
are never actually executed, as the UDF and this derived matrix are part of
the execution plan (if I'm using that term correctly).

Is there any reason that multiple copies of the expElogbeta matrix have to
be computed, rather than being computed the first time and then cached?
Given that this transform() method is part of using any fitted LDA model,
and is likely to be used frequently, I think any added efficiency would be
helpful.

Happy to share additional information as needed!

Regards,
Andrew

Reply via email to