I think I understand where Spark saves IO. in MR we have map -> reduce -> map -> reduce -> map -> reduce ...
which writes results do disk at the end of each such "arrow", on the other hand in spark we have map -> reduce + map -> reduce + map -> reduce ... which saves about 2 times the IO thanks everyone, krexos ------- Original Message ------- On Saturday, July 2nd, 2022 at 1:35 PM, krexos <kre...@protonmail.com> wrote: > Hello, > > One of the main "selling points" of Spark is that unlike Hadoop map-reduce > that persists intermediate results of its computation to HDFS (disk), Spark > keeps all its results in memory. I don't understand this as in reality when a > Spark stage finishes[it writes all of the data into shuffle files stored on > the > disk](https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md). > How then is this an improvement on map-reduce? > > Image from https://youtu.be/7ooZ4S7Ay6Y > > thanks!