shuffle data taking immense disk space during ALS

Antony Mayi Mon, 23 Feb 2015 06:28:01 -0800

Hi,
This has already been briefly discussed here in the past but there seems to be 
more questions...
I am running bigger ALS task with input data ~40GB (~3 billions of ratings). 
The data is partitioned into 512 partitions and I am also using default 
parallelism set to 512. The ALS runs with rank=100, iters=15. Using spark 1.2.0.
The issue is the volume of temporal data stored on disks generated during the 
processing. You can see the effect here: http://picpaste.com/disk-UKGFOlte.png 
It stores 12TB!!! of data until it reaches the 90% threshold when yarn kills it.
I have checkpoint directory set so allegedly it should be clearing the temp 
data but not sure that's happening (although there is 1 drop seen).
Is there any solution for this? 12TB of temp not getting cleaned seems to be 
wrong.
Thanks,Antony.

shuffle data taking immense disk space during ALS

Reply via email to