Re: Spark not releasing shuffle files in time (with very large heap)

naresh Goud Thu, 22 Feb 2018 18:58:35 -0800

It would be very difficult to tell without knowing what is your application
code doing, what kind of transformation/actions performing. From my
previous experience tuning application code which avoids unnecessary
objects reduce pressure on GC.



On Thu, Feb 22, 2018 at 2:13 AM, Keith Chapman <keithgchap...@gmail.com>
wrote:

> Hi,
>
> I'm benchmarking a spark application by running it for multiple
> iterations, its a benchmark thats heavy on shuffle and I run it on a local
> machine with a very large hear (~200GB). The system has a SSD. When running
> for 3 to 4 iterations I get into a situation that I run out of disk space
> on the /tmp directory. On further investigation I was able to figure out
> that the reason for this is that the shuffle files are still around,
> because I have a very large hear GC has not happen and hence the shuffle
> files are not deleted. I was able to confirm this by lowering the heap size
> and I see GC kicking in more often and the size of /tmp stays under
> control. Is there any way I could configure spark to handle this issue?
>
> One option that I have is to have GC run more often by
> setting spark.cleaner.periodicGC.interval to a much lower value. Is there
> a cleaner solution?
>
> Regards,
> Keith.
>
> http://keith-chapman.com
>

Re: Spark not releasing shuffle files in time (with very large heap)

Reply via email to