I have explained the same thing in a very layman's terms. Go through it once.
On Sat, 2 Jul 2022, 19:45 krexos, <[email protected]> wrote: > > I think I understand where Spark saves IO. > > in MR we have map -> reduce -> map -> reduce -> map -> reduce ... > > which writes results do disk at the end of each such "arrow", > > on the other hand in spark we have > > map -> reduce + map -> reduce + map -> reduce ... > > which saves about 2 times the IO > > thanks everyone, > krexos > > ------- Original Message ------- > On Saturday, July 2nd, 2022 at 1:35 PM, krexos <[email protected]> > wrote: > > Hello, > > One of the main "selling points" of Spark is that unlike Hadoop map-reduce > that persists intermediate results of its computation to HDFS (disk), Spark > keeps all its results in memory. I don't understand this as in reality when > a Spark stage finishes it writes all of the data into shuffle files > stored on the disk > <https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md>. > How then is this an improvement on map-reduce? > > Image from https://youtu.be/7ooZ4S7Ay6Y > > > thanks! > > >
