Re: why Shuffle Write is not zero when everything is cached and there is enough memory?

2015-03-31 Thread Udit Mehta
I have noticed a similar issue when using spark streaming. The spark shuffle write size increases to a large size(in GB) and then the app crashes saying: java.io.FileNotFoundException:

Re: why Shuffle Write is not zero when everything is cached and there is enough memory?

2015-03-31 Thread Udit Mehta
Thanks for the reply. This will reduce the shuffle write to disk to an extent but for a long running job(multiple days), the shuffle write would still occupy a lot of space on disk. Why do we need to store the data from older map tasks to memory? On Tue, Mar 31, 2015 at 1:19 PM, Bijay Pathak

Re: why Shuffle Write is not zero when everything is cached and there is enough memory?

2015-03-31 Thread Bijay Pathak
The Spark Sort-Based Shuffle (default from 1.1) keeps the data from each Map tasks to memory until they they can't fit after which they are sorted and spilled to disk. You can reduce the Shuffle write to disk by increasing spark.shuffle.memoryFraction(default 0.2). By writing the shuffle output

Re: why Shuffle Write is not zero when everything is cached and there is enough memory?

2015-03-31 Thread Bijay Pathak
Hi Udit, The persisted RDDs in memory is cleared by Spark using LRU policy and you can also set the time to clear the persisted RDDs and meta-data by setting* spark.cleaner.ttl *(default infinite). But I am not aware about any properties to clean the older shuffle write from from disks. thanks,

Re: why Shuffle Write is not zero when everything is cached and there is enough memory?

2015-03-30 Thread Saisai Shao
Shuffle write will finally spill the data into file system as a bunch of files. If you want to avoid disk write, you can mount a ramdisk and configure spark.local.dir to this ram disk. So shuffle output will write to memory based FS, and will not introduce disk IO. Thanks Jerry 2015-03-30 17:15

why Shuffle Write is not zero when everything is cached and there is enough memory?

2015-03-30 Thread shahab
Hi, I was looking at SparkUI, Executors, and I noticed that I have 597 MB for Shuffle while I am using cached temp-table and the Spark had 2 GB free memory (the number under Memory Used is 597 MB /2.6 GB) ?!!! Shouldn't be Shuffle Write be zero and everything (map/reduce) tasks be done in

Re: why Shuffle Write is not zero when everything is cached and there is enough memory?

2015-03-30 Thread shahab
Thanks Saisai. I will try your solution, but still i don't understand why filesystem should be used where there is a plenty of memory available! On Mon, Mar 30, 2015 at 11:22 AM, Saisai Shao sai.sai.s...@gmail.com wrote: Shuffle write will finally spill the data into file system as a bunch of