LDAOptimizer.scala:421 collects to driver a numTopics by vocabSize matrix
of summary statistics. I suspect that this is what's causing the failure.

One thing you may try doing is decreasing the vocabulary size. One
possibility would be to use a HashingTF if you don't mind dimension
reduction via hashing collisions.

On Mon, Jul 20, 2015 at 3:21 AM, Peter Zvirinsky <peter.zvirin...@gmail.com>
wrote:

> Hello,
>
> I'm trying to run LDA on a relatively large dataset (size 100-200 G), but
> with no luck so far.
>
> At first I made sure that the executors have enough memory with respect to
> the vocabulary size and number of topics.
>
> After that I ran LDA with default EMLDAOptimizer, but learning failed
> after a few iteration, because the application master ran out of disk. The
> learning job used all space available in the usercache of the application
> master (cca. 100G). I noticed that this implementation uses some sort of
> checkopointing so I made sure it is not used, but it didn't help.
>
> Afterwards, I tried the OnlineLDAOptimizer, but it started failing at
> "reduce at LDAOptimizer.scala:421" with error message: "Total size of
> serialized results of X tasks (Y GB) is bigger than
> spark.driver.maxResultSize (Y GB)". I kept increasing the
> spark.driver.maxResultSize to tens of GB but it didn't help, just delayed
> this error. I tried to adjust the batch size to very small values so that I
> was sure it must fit into memory, but this didn't help at all.
>
> Has anyone experience with learning LDA on such a dataset? Maybe some
> ideas what might be wrong?
>
> I'm using spark 1.4.0 in yarn-client mode. I managed to learn a word2vec
> model on the same dataset with no problems at all.
>
> Thanks,
> Peter
>

Reply via email to