thanks, so if I have enough large memory (with enough spark.shuffle.memory)
then shuffle (in-memory shuffle) spill doesn't happen (per node) but still
shuffle data has to be ultimately written to disk so that reduce stage
pulls if across network?

On Wed, Aug 5, 2015 at 4:40 PM, Saisai Shao <sai.sai.s...@gmail.com> wrote:

> Hi Muler,
>
> Shuffle data will be written to disk, no matter how large memory you have,
> large memory could alleviate shuffle spill where temporary file will be
> generated if memory is not enough.
>
> Yes, each node writes shuffle data to file and pulled from disk in reduce
> stage from network framework (default is Netty).
>
> Thanks
> Saisai
>
> On Thu, Aug 6, 2015 at 7:10 AM, Muler <mulugeta.abe...@gmail.com> wrote:
>
>> Hi,
>>
>> Consider I'm running WordCount with 100m of data on 4 node cluster.
>> Assuming my RAM size on each node is 200g and i'm giving my executors 100g
>> (just enough memory for 100m data)
>>
>>
>>    1. If I have enough memory, can Spark 100% avoid writing to disk?
>>    2. During shuffle, where results have to be collected from nodes,
>>    does each node write to disk and then the results are pulled from disk? If
>>    not, what is the API that is being used to pull data from nodes across the
>>    cluster? (I'm thinking what Scala or Java packages would allow you to read
>>    in-memory data from other machines?)
>>
>> Thanks,
>>
>
>

Reply via email to