Did you try to use less number of partitions (user/product blocks)? Did you use implicit feedback? In the current implementation, we only do checkpointing with implicit feedback. We should adopt the checkpoint strategy implemented in LDA: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/impl/PeriodicGraphCheckpointer.scala for ALS. Could you try the latest branch-1.3 or master and see whether it helps? -Xiangrui
On Mon, Feb 23, 2015 at 6:21 AM, Antony Mayi <antonym...@yahoo.com.invalid> wrote: > Hi, > > This has already been briefly discussed here in the past but there seems to > be more questions... > > I am running bigger ALS task with input data ~40GB (~3 billions of ratings). > The data is partitioned into 512 partitions and I am also using default > parallelism set to 512. The ALS runs with rank=100, iters=15. Using spark > 1.2.0. > > The issue is the volume of temporal data stored on disks generated during > the processing. You can see the effect here: > http://picpaste.com/disk-UKGFOlte.png It stores 12TB!!! of data until it > reaches the 90% threshold when yarn kills it. > > I have checkpoint directory set so allegedly it should be clearing the temp > data but not sure that's happening (although there is 1 drop seen). > > Is there any solution for this? 12TB of temp not getting cleaned seems to be > wrong. > > Thanks, > Antony. > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org