Re: Long-running job cleanup

2014-12-31 Thread Ganelin, Ilya
;, Patrick Wendell mailto:pwend...@gmail.com>> Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: Re: Long-running job cleanup Hi Patrick, to follow up on the below discussion, I am including a short code snippet that produce

Re: Long-running job cleanup

2014-12-30 Thread Ganelin, Ilya
ailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: Re: Long-running job cleanup Hi Patrick - is that cleanup present in 1.1? The overhead I am talking about is with regards to what I believe is shuffle related metadata. If I watch the execution log I see sma

Re: Long-running job cleanup

2014-12-28 Thread Ilya Ganelin
Hi Patrick - is that cleanup present in 1.1? The overhead I am talking about is with regards to what I believe is shuffle related metadata. If I watch the execution log I see small broadcast variables created for every stage of execution, a few KB at a time, and a certain number of MB remaining of

Re: Long-running job cleanup

2014-12-28 Thread Patrick Wendell
What do you mean when you say "the overhead of spark shuffles start to accumulate"? Could you elaborate more? In newer versions of Spark shuffle data is cleaned up automatically when an RDD goes out of scope. It is safe to remove shuffle data at this point because the RDD can no longer be referenc

Re: Long-running job cleanup

2014-12-25 Thread Ilya Ganelin
Hello all - can anyone please offer any advice on this issue? -Ilya Ganelin On Mon, Dec 22, 2014 at 5:36 PM, Ganelin, Ilya wrote: > Hi all, I have a long running job iterating over a huge dataset. Parts of > this operation are cached. Since the job runs for so long, eventually the > overhead of

Long-running job cleanup

2014-12-22 Thread Ganelin, Ilya
Hi all, I have a long running job iterating over a huge dataset. Parts of this operation are cached. Since the job runs for so long, eventually the overhead of spark shuffles starts to accumulate culminating in the driver starting to swap. I am aware of the spark.cleanup.tll parameter that allo