Yes, finally shuffle data will be written to disk for reduce stage to pull,
no matter how large you set to shuffle memory fraction.

Thanks
Saisai

On Thu, Aug 6, 2015 at 7:50 AM, Muler <mulugeta.abe...@gmail.com> wrote:

> thanks, so if I have enough large memory (with enough
> spark.shuffle.memory) then shuffle (in-memory shuffle) spill doesn't happen
> (per node) but still shuffle data has to be ultimately written to disk so
> that reduce stage pulls if across network?
>
> On Wed, Aug 5, 2015 at 4:40 PM, Saisai Shao <sai.sai.s...@gmail.com>
> wrote:
>
>> Hi Muler,
>>
>> Shuffle data will be written to disk, no matter how large memory you
>> have, large memory could alleviate shuffle spill where temporary file will
>> be generated if memory is not enough.
>>
>> Yes, each node writes shuffle data to file and pulled from disk in reduce
>> stage from network framework (default is Netty).
>>
>> Thanks
>> Saisai
>>
>> On Thu, Aug 6, 2015 at 7:10 AM, Muler <mulugeta.abe...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Consider I'm running WordCount with 100m of data on 4 node cluster.
>>> Assuming my RAM size on each node is 200g and i'm giving my executors 100g
>>> (just enough memory for 100m data)
>>>
>>>
>>>    1. If I have enough memory, can Spark 100% avoid writing to disk?
>>>    2. During shuffle, where results have to be collected from nodes,
>>>    does each node write to disk and then the results are pulled from disk? 
>>> If
>>>    not, what is the API that is being used to pull data from nodes across 
>>> the
>>>    cluster? (I'm thinking what Scala or Java packages would allow you to 
>>> read
>>>    in-memory data from other machines?)
>>>
>>> Thanks,
>>>
>>
>>
>

Reply via email to