Hi Udit, The persisted RDDs in memory is cleared by Spark using LRU policy and you can also set the time to clear the persisted RDDs and meta-data by setting* spark.cleaner.ttl *(default infinite). But I am not aware about any properties to clean the older shuffle write from from disks.
thanks, bijay On Tue, Mar 31, 2015 at 1:50 PM, Udit Mehta <ume...@groupon.com> wrote: > Thanks for the reply. > This will reduce the shuffle write to disk to an extent but for a long > running job(multiple days), the shuffle write would still occupy a lot of > space on disk. Why do we need to store the data from older map tasks to > memory? > > On Tue, Mar 31, 2015 at 1:19 PM, Bijay Pathak <bijay.pat...@cloudwick.com> > wrote: > >> The Spark Sort-Based Shuffle (default from 1.1) keeps the data from >> each Map tasks to memory until they they can't fit after which they >> are sorted and spilled to disk. You can reduce the Shuffle write to >> disk by increasing spark.shuffle.memoryFraction(default 0.2). >> >> By writing the shuffle output to disk the Spark lineage can be >> truncated when the RDDs are already materialized as the side-effects >> of earlier shuffle.This is the under the hood optimization in Spark >> which is only possible because of shuffle output output being written >> to disk. >> >> You can set spark.shuffle.spill to false if you don't want to spill to >> the disk and assuming you have enough heap memory. >> >> On Tue, Mar 31, 2015 at 12:35 PM, Udit Mehta <ume...@groupon.com> wrote: >> > I have noticed a similar issue when using spark streaming. The spark >> shuffle >> > write size increases to a large size(in GB) and then the app crashes >> saying: >> > java.io.FileNotFoundException: >> > >> /data/vol0/nodemanager/usercache/$user/appcache/application_1427480955913_0339/spark-local-20150330231234-db1a/0b/temp_shuffle_1b23808f-f285-40b2-bec7-1c6790050d7f >> > (No such file or directory) >> > >> > I dont understand why the shuffle size increases to such a large value >> for >> > long running jobs. >> > >> > Thanks, >> > Udiy >> > >> > On Mon, Mar 30, 2015 at 4:01 AM, shahab <shahab.mok...@gmail.com> >> wrote: >> >> >> >> Thanks Saisai. I will try your solution, but still i don't understand >> why >> >> filesystem should be used where there is a plenty of memory available! >> >> >> >> >> >> >> >> On Mon, Mar 30, 2015 at 11:22 AM, Saisai Shao <sai.sai.s...@gmail.com> >> >> wrote: >> >>> >> >>> Shuffle write will finally spill the data into file system as a bunch >> of >> >>> files. If you want to avoid disk write, you can mount a ramdisk and >> >>> configure "spark.local.dir" to this ram disk. So shuffle output will >> write >> >>> to memory based FS, and will not introduce disk IO. >> >>> >> >>> Thanks >> >>> Jerry >> >>> >> >>> 2015-03-30 17:15 GMT+08:00 shahab <shahab.mok...@gmail.com>: >> >>>> >> >>>> Hi, >> >>>> >> >>>> I was looking at SparkUI, Executors, and I noticed that I have 597 MB >> >>>> for "Shuffle while I am using cached temp-table and the Spark had 2 >> GB free >> >>>> memory (the number under Memory Used is 597 MB /2.6 GB) ?!!! >> >>>> >> >>>> Shouldn't be Shuffle Write be zero and everything (map/reduce) tasks >> be >> >>>> done in memory? >> >>>> >> >>>> best, >> >>>> >> >>>> /Shahab >> >>> >> >>> >> >> >> > >> > >