In my usage of MLLib's LDA, I have noticed that repeated invocations of LDAModel.transform() result in the duplication of a matrix derived from the model's topic matrix. Because this derived matrix can be quite large (imagine hundreds of topics, and vocabulary size in the tens or hundreds of thousands of words), this duplication can create quite a memory burden on the driver. In a streaming job, I was able to crash the driver with an Out of Memory exception by invoking LDAModel.transform() frequently.
The root of the issue as I see it is at org.apache.spark.mllib.clustering.LDAModel#getTopicDistributionMethod() ( https://github.com/apache/spark/blob/v2.4.0/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala#L394), which is called every time by org.apache.spark.ml.clustering.LDAModel#transform() ( https://github.com/apache/spark/blob/v2.4.0/mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala#L458). Within getTopicDistributionMethod(), a fresh expElogbeta matrix is computed, rather than being drawn from a cache. When LDAModel.transform() is called multiple times (e.g., in a streaming job), this expElogbeta matrix ends up being reduplicated, potentially taking up significant amounts of driver memory. This is true even if the transform() computations are never actually executed, as the UDF and this derived matrix are part of the execution plan (if I'm using that term correctly). Is there any reason that multiple copies of the expElogbeta matrix have to be computed, rather than being computed the first time and then cached? Given that this transform() method is part of using any fitted LDA model, and is likely to be used frequently, I think any added efficiency would be helpful. Happy to share additional information as needed! Regards, Andrew