...@gmail.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Long-running job cleanup
Hi Patrick, to follow up on the below discussion, I am including a short code
snippet that produces the problem on 1.1. This is kind of stupid
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Long-running job cleanup
Hi Patrick - is that cleanup present in 1.1?
The overhead I am talking about is with regards to what I believe is shuffle
related metadata. If I watch the execution log I see small broadcast variables
created
What do you mean when you say the overhead of spark shuffles start to
accumulate? Could you elaborate more?
In newer versions of Spark shuffle data is cleaned up automatically
when an RDD goes out of scope. It is safe to remove shuffle data at
this point because the RDD can no longer be
Hi Patrick - is that cleanup present in 1.1?
The overhead I am talking about is with regards to what I believe is
shuffle related metadata. If I watch the execution log I see small
broadcast variables created for every stage of execution, a few KB at a
time, and a certain number of MB remaining
Hello all - can anyone please offer any advice on this issue?
-Ilya Ganelin
On Mon, Dec 22, 2014 at 5:36 PM, Ganelin, Ilya ilya.gane...@capitalone.com
wrote:
Hi all, I have a long running job iterating over a huge dataset. Parts of
this operation are cached. Since the job runs for so long,
Hi all, I have a long running job iterating over a huge dataset. Parts of this
operation are cached. Since the job runs for so long, eventually the overhead
of spark shuffles starts to accumulate culminating in the driver starting to
swap.
I am aware of the spark.cleanup.tll parameter that