Hi,
I am running bigger ALS on spark 1.2.0 on yarn (cdh 5.3.0) - ALS is using about
3 billions of ratings and I am doing several trainImplicit() runs in loop
within one spark session. I have four node cluster with 3TB disk space on each.
before starting the job there is less then 8% of the disk space used. while the
ALS is running I can see the disk usage rapidly growing mainly because of files
being stored under
yarn/local/usercache/user/appcache/application_XXX_YYY/spark-local-ZZZ-AAA.
after about 10 hours the disk usage hits 90% and yarn kills the particular
containers.
am I missing doing some cleanup somewhere while looping over the several
trainImplicit() calls? taking 4*3TB of disk space seems immense.
thanks for any help,Antony.