Re: How is Spark a memory based solution if it writes data to disk before shuffles?

krexos Sat, 02 Jul 2022 06:41:58 -0700

This doesn't add up with what's described in the internals page I included. 
What you are talking about is shuffle spills at the beginning of the stage. 
What I am talking about is that at the end of the stage spark writes all of the 
stage's results to shuffle files on disk, thus we will have the same amount of 
IO writes as there are stages.


thanks,
krexos

------- Original Message -------
On Saturday, July 2nd, 2022 at 3:34 PM, Sid <flinkbyhe...@gmail.com> wrote:

> Hi Krexos,
>
> If I understand correctly, you are trying to ask that even spark involves 
> disk i/o then how it is an advantage over map reduce.
>
> Basically, Map Reduce phase writes every intermediate results to the disk. So 
> on an average it involves 6 times disk I/O whereas spark(assuming it has an 
> enough memory to store intermediate results) on an average involves 3 times 
> less disk I/O i.e only while reading the data and writing the final data to 
> the disk.
>
> Thanks,
> Sid
>
> On Sat, 2 Jul 2022, 17:58 krexos, <kre...@protonmail.com.invalid> wrote:
>
>> Hello,
>>
>> One of the main "selling points" of Spark is that unlike Hadoop map-reduce 
>> that persists intermediate results of its computation to HDFS (disk), Spark 
>> keeps all its results in memory. I don't understand this as in reality when 
>> a Spark stage finishes[it writes all of the data into shuffle files stored 
>> on the 
>> disk](https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md).
>>  How then is this an improvement on map-reduce?
>>
>> Image from https://youtu.be/7ooZ4S7Ay6Y
>>
>> thanks!

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

Reply via email to