LDAOptimizer.scala:421 collects to driver a numTopics by vocabSize matrix of summary statistics. I suspect that this is what's causing the failure.
One thing you may try doing is decreasing the vocabulary size. One possibility would be to use a HashingTF if you don't mind dimension reduction via hashing collisions. On Mon, Jul 20, 2015 at 3:21 AM, Peter Zvirinsky <peter.zvirin...@gmail.com> wrote: > Hello, > > I'm trying to run LDA on a relatively large dataset (size 100-200 G), but > with no luck so far. > > At first I made sure that the executors have enough memory with respect to > the vocabulary size and number of topics. > > After that I ran LDA with default EMLDAOptimizer, but learning failed > after a few iteration, because the application master ran out of disk. The > learning job used all space available in the usercache of the application > master (cca. 100G). I noticed that this implementation uses some sort of > checkopointing so I made sure it is not used, but it didn't help. > > Afterwards, I tried the OnlineLDAOptimizer, but it started failing at > "reduce at LDAOptimizer.scala:421" with error message: "Total size of > serialized results of X tasks (Y GB) is bigger than > spark.driver.maxResultSize (Y GB)". I kept increasing the > spark.driver.maxResultSize to tens of GB but it didn't help, just delayed > this error. I tried to adjust the batch size to very small values so that I > was sure it must fit into memory, but this didn't help at all. > > Has anyone experience with learning LDA on such a dataset? Maybe some > ideas what might be wrong? > > I'm using spark 1.4.0 in yarn-client mode. I managed to learn a word2vec > model on the same dataset with no problems at all. > > Thanks, > Peter >