So as per the discussion, shuffle stages output is also stored on disk and not in memory?
On Sat, Jul 2, 2022 at 8:44 PM krexos <kre...@protonmail.com> wrote: > > thanks a lot! > > ------- Original Message ------- > On Saturday, July 2nd, 2022 at 6:07 PM, Sean Owen <sro...@gmail.com> > wrote: > > I think that is more accurate yes. Though, shuffle files are local, not on > distributed storage too, which is an advantage. MR also had map only > transforms and chained mappers, but harder to use. Not impossible but you > could also say Spark just made it easier to do the more efficient thing. > > On Sat, Jul 2, 2022, 9:34 AM krexos <kre...@protonmail.com.invalid> wrote: > >> >> You said Spark performs IO only when reading data and writing final data >> to the disk. I though by that you meant that it only reads the input files >> of the job and writes the output of the whole job to the disk, but in >> reality spark does store intermediate results on disk, just in less places >> than MR >> >> ------- Original Message ------- >> On Saturday, July 2nd, 2022 at 5:27 PM, Sid <flinkbyhe...@gmail.com> >> wrote: >> >> I have explained the same thing in a very layman's terms. Go through it >> once. >> >> On Sat, 2 Jul 2022, 19:45 krexos, <kre...@protonmail.com.invalid> wrote: >> >>> >>> I think I understand where Spark saves IO. >>> >>> in MR we have map -> reduce -> map -> reduce -> map -> reduce ... >>> >>> which writes results do disk at the end of each such "arrow", >>> >>> on the other hand in spark we have >>> >>> map -> reduce + map -> reduce + map -> reduce ... >>> >>> which saves about 2 times the IO >>> >>> thanks everyone, >>> krexos >>> >>> ------- Original Message ------- >>> On Saturday, July 2nd, 2022 at 1:35 PM, krexos <kre...@protonmail.com> >>> wrote: >>> >>> Hello, >>> >>> One of the main "selling points" of Spark is that unlike Hadoop >>> map-reduce that persists intermediate results of its computation to HDFS >>> (disk), Spark keeps all its results in memory. I don't understand this as >>> in reality when a Spark stage finishes it writes all of the data into >>> shuffle files stored on the disk >>> <https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md>. >>> How then is this an improvement on map-reduce? >>> >>> Image from https://youtu.be/7ooZ4S7Ay6Y >>> >>> >>> thanks! >>> >>> >>> >> >