Correct, brute force clean up is not useful. Since Spark 1.0, Spark can do automatic cleanup of files based on which RDDs are used/garbage collected by JVM. That would be the best way, but depends on the JVM GC characteristics. If you force a GC periodically in the driver that might help you get rid of files in the workers that are not needed.
TD On Mon, Feb 16, 2015 at 12:27 AM, Antony Mayi <antonym...@yahoo.com.invalid> wrote: > spark.cleaner.ttl is not the right way - seems to be really designed for > streaming. although it keeps the disk usage under control it also causes > loss of rdds and broadcasts that are required later leading to crash. > > is there any other way? > thanks, > Antony. > > > On Sunday, 15 February 2015, 21:42, Antony Mayi <antonym...@yahoo.com> > wrote: > > > > spark.cleaner.ttl ? > > > On Sunday, 15 February 2015, 18:23, Antony Mayi <antonym...@yahoo.com> > wrote: > > > > Hi, > > I am running bigger ALS on spark 1.2.0 on yarn (cdh 5.3.0) - ALS is using > about 3 billions of ratings and I am doing several trainImplicit() runs in > loop within one spark session. I have four node cluster with 3TB disk space > on each. before starting the job there is less then 8% of the disk space > used. while the ALS is running I can see the disk usage rapidly growing > mainly because of files being stored > under > yarn/local/usercache/user/appcache/application_XXX_YYY/spark-local-ZZZ-AAA. > after about 10 hours the disk usage hits 90% and yarn kills the particular > containers. > > am I missing doing some cleanup somewhere while looping over the several > trainImplicit() calls? taking 4*3TB of disk space seems immense. > > thanks for any help, > Antony. > > > > > >