You're right. I suppose I just mean most operations don't need a shuffle - you don't have 10 stages for 10 transformations. Also: caching in memory is another way that memory is used to avoid IO.
On Sat, Jul 2, 2022, 8:42 AM krexos <kre...@protonmail.com.invalid> wrote: > This doesn't add up with what's described in the internals page I > included. What you are talking about is shuffle spills at the beginning of > the stage. What I am talking about is that at the end of the stage spark > writes all of the stage's results to shuffle files on disk, thus we will > have the same amount of IO writes as there are stages. > > thanks, > krexos > > ------- Original Message ------- > On Saturday, July 2nd, 2022 at 3:34 PM, Sid <flinkbyhe...@gmail.com> > wrote: > > Hi Krexos, > > If I understand correctly, you are trying to ask that even spark involves > disk i/o then how it is an advantage over map reduce. > > Basically, Map Reduce phase writes every intermediate results to the disk. > So on an average it involves 6 times disk I/O whereas spark(assuming it has > an enough memory to store intermediate results) on an average involves 3 > times less disk I/O i.e only while reading the data and writing the final > data to the disk. > > Thanks, > Sid > > On Sat, 2 Jul 2022, 17:58 krexos, <kre...@protonmail.com.invalid> wrote: > >> Hello, >> >> One of the main "selling points" of Spark is that unlike Hadoop >> map-reduce that persists intermediate results of its computation to HDFS >> (disk), Spark keeps all its results in memory. I don't understand this as >> in reality when a Spark stage finishes it writes all of the data into >> shuffle files stored on the disk >> <https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md>. >> How then is this an improvement on map-reduce? >> >> Image from https://youtu.be/7ooZ4S7Ay6Y >> >> >> thanks! >> > >