Re: How is Spark a memory based solution if it writes data to disk before shuffles?

krexos Sat, 02 Jul 2022 07:34:56 -0700

You said Spark performs IO only when reading data and writing final data to the 
disk. I though by that you meant that it only reads the input files of the job 
and writes the output of the whole job to the disk, but in reality spark does 
store intermediate results on disk, just in less places than MR


------- Original Message -------
On Saturday, July 2nd, 2022 at 5:27 PM, Sid <flinkbyhe...@gmail.com> wrote:

> I have explained the same thing in a very layman's terms. Go through it once.
>
> On Sat, 2 Jul 2022, 19:45 krexos, <kre...@protonmail.com.invalid> wrote:
>
>> I think I understand where Spark saves IO.
>>
>> in MR we have map -> reduce -> map -> reduce -> map -> reduce ...
>>
>> which writes results do disk at the end of each such "arrow",
>>
>> on the other hand in spark we have
>>
>> map -> reduce + map -> reduce + map -> reduce ...
>>
>> which saves about 2 times the IO
>>
>> thanks everyone,
>> krexos
>>
>> ------- Original Message -------
>> On Saturday, July 2nd, 2022 at 1:35 PM, krexos <kre...@protonmail.com> wrote:
>>
>>> Hello,
>>>
>>> One of the main "selling points" of Spark is that unlike Hadoop map-reduce 
>>> that persists intermediate results of its computation to HDFS (disk), Spark 
>>> keeps all its results in memory. I don't understand this as in reality when 
>>> a Spark stage finishes[it writes all of the data into shuffle files stored 
>>> on the 
>>> disk](https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md).
>>>  How then is this an improvement on map-reduce?
>>>
>>> Image from https://youtu.be/7ooZ4S7Ay6Y
>>>
>>> thanks!

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

Reply via email to