Re: How is Spark a memory based solution if it writes data to disk before shuffles?

Sid Sat, 02 Jul 2022 07:27:24 -0700

I have explained the same thing in a very layman's terms. Go through it
once.


On Sat, 2 Jul 2022, 19:45 krexos, <[email protected]> wrote:

>
> I think I understand where Spark saves IO.
>
> in MR we have map -> reduce -> map  -> reduce -> map -> reduce ...
>
> which writes results do disk at the end of each such "arrow",
>
> on the other hand in spark we have
>
> map -> reduce + map -> reduce + map -> reduce ...
>
> which saves about 2 times the IO
>
> thanks everyone,
> krexos
>
> ------- Original Message -------
> On Saturday, July 2nd, 2022 at 1:35 PM, krexos <[email protected]>
> wrote:
>
> Hello,
>
> One of the main "selling points" of Spark is that unlike Hadoop map-reduce
> that persists intermediate results of its computation to HDFS (disk), Spark
> keeps all its results in memory. I don't understand this as in reality when
> a Spark stage finishes it writes all of the data into shuffle files
> stored on the disk
> <https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md>.
> How then is this an improvement on map-reduce?
>
> Image from https://youtu.be/7ooZ4S7Ay6Y
>
>
> thanks!
>
>
>

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

Reply via email to