On 29 Oct 2013, at 02:47, Matei Zaharia <matei.zaha...@gmail.com> wrote: > Yes, we still write out data after these tasks in Spark 0.8, and it needs to > be written out before any stage that reads it can start. The main reason is > simplicity when there are faults, as well as more flexible scheduling (you > don't have to decide where each reduce task is in advance, you can have more > reduce tasks than you have CPU cores, etc).
Thank you for the answer! I have a follow-up: In which fraction (RDD or non-RDD) of the heap will the output be stored before spilling to disk? I have a job where I read over all large data set once and don't persist anything. Would it make sense to set "spark.storage.memoryFraction" to a smaller value in order to avoid spilling to disk? - Ufuk