I have explained the same thing in a very layman's terms. Go through it once.
On Sat, 2 Jul 2022, 19:45 krexos, <kre...@protonmail.com.invalid> wrote: > > I think I understand where Spark saves IO. > > in MR we have map -> reduce -> map -> reduce -> map -> reduce ... > > which writes results do disk at the end of each such "arrow", > > on the other hand in spark we have > > map -> reduce + map -> reduce + map -> reduce ... > > which saves about 2 times the IO > > thanks everyone, > krexos > > ------- Original Message ------- > On Saturday, July 2nd, 2022 at 1:35 PM, krexos <kre...@protonmail.com> > wrote: > > Hello, > > One of the main "selling points" of Spark is that unlike Hadoop map-reduce > that persists intermediate results of its computation to HDFS (disk), Spark > keeps all its results in memory. I don't understand this as in reality when > a Spark stage finishes it writes all of the data into shuffle files > stored on the disk > <https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md>. > How then is this an improvement on map-reduce? > > Image from https://youtu.be/7ooZ4S7Ay6Y > > > thanks! > > >