Re: How is Spark a memory based solution if it writes data to disk before shuffles?

krexos Sat, 02 Jul 2022 06:43:20 -0700

Don't stages by definition include a shuffle? If you didn't need a shuffle 
between 2 stages you could merge them into one stage.


thanks,
krexos

------- Original Message -------
On Saturday, July 2nd, 2022 at 4:13 PM, Sean Owen <sro...@gmail.com> wrote:

> Because only shuffle stages write shuffle files. Most stages are not shuffles
>
> On Sat, Jul 2, 2022, 7:28 AM krexos <kre...@protonmail.com.invalid> wrote:
>
>> Hello,
>>
>> One of the main "selling points" of Spark is that unlike Hadoop map-reduce 
>> that persists intermediate results of its computation to HDFS (disk), Spark 
>> keeps all its results in memory. I don't understand this as in reality when 
>> a Spark stage finishes[it writes all of the data into shuffle files stored 
>> on the 
>> disk](https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md).
>>  How then is this an improvement on map-reduce?
>>
>> Image from https://youtu.be/7ooZ4S7Ay6Y
>>
>> thanks!

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

Reply via email to