Re: Task output before a shuffle

Ufuk Celebi Tue, 29 Oct 2013 05:15:27 -0700

On 29 Oct 2013, at 02:47, Matei Zaharia <matei.zaha...@gmail.com> wrote:
> Yes, we still write out data after these tasks in Spark 0.8, and it needs to 
> be written out before any stage that reads it can start. The main reason is 
> simplicity when there are faults, as well as more flexible scheduling (you 
> don't have to decide where each reduce task is in advance, you can have more 
> reduce tasks than you have CPU cores, etc).


Thank you for the answer! I have a follow-up:

In which fraction (RDD or non-RDD) of the heap will the output be stored before 
spilling to disk?

I have a job where I read over all large data set once and don't persist 
anything. Would it make sense to set "spark.storage.memoryFraction" to a 
smaller value in order to avoid spilling to disk?

- Ufuk

Re: Task output before a shuffle

Reply via email to