Re: How is Spark a memory based solution if it writes data to disk before shuffles?

krexos Sat, 02 Jul 2022 07:15:14 -0700

I think I understand where Spark saves IO.

in MR we have map -> reduce -> map -> reduce -> map -> reduce ...


which writes results do disk at the end of each such "arrow",

on the other hand in spark we have

map -> reduce + map -> reduce + map -> reduce ...

which saves about 2 times the IO

thanks everyone,
krexos

------- Original Message -------
On Saturday, July 2nd, 2022 at 1:35 PM, krexos <kre...@protonmail.com> wrote:

> Hello,
>
> One of the main "selling points" of Spark is that unlike Hadoop map-reduce 
> that persists intermediate results of its computation to HDFS (disk), Spark 
> keeps all its results in memory. I don't understand this as in reality when a 
> Spark stage finishes[it writes all of the data into shuffle files stored on 
> the 
> disk](https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md).
>  How then is this an improvement on map-reduce?
>
> Image from https://youtu.be/7ooZ4S7Ay6Y
>
> thanks!

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

Reply via email to