Re: How is Spark a memory based solution if it writes data to disk before shuffles?

Sean Owen Sat, 02 Jul 2022 06:45:33 -0700

You're right. I suppose I just mean most operations don't need a shuffle -
you don't have 10 stages for 10 transformations. Also: caching in memory is
another way that memory is used to avoid IO.


On Sat, Jul 2, 2022, 8:42 AM krexos <kre...@protonmail.com.invalid> wrote:

> This doesn't add up with what's described in the internals page I
> included. What you are talking about is shuffle spills at the beginning of
> the stage. What I am talking about is that at the end of the stage spark
> writes all of the stage's results to shuffle files on disk, thus we will
> have the same amount of IO writes as there are stages.
>
> thanks,
> krexos
>
> ------- Original Message -------
> On Saturday, July 2nd, 2022 at 3:34 PM, Sid <flinkbyhe...@gmail.com>
> wrote:
>
> Hi Krexos,
>
> If I understand correctly, you are trying to ask that even spark involves
> disk i/o then how it is an advantage over map reduce.
>
> Basically, Map Reduce phase writes every intermediate results to the disk.
> So on an average it involves 6 times disk I/O whereas spark(assuming it has
> an enough memory to store intermediate results) on an average involves 3
> times less disk I/O i.e only while reading the data and writing the final
> data to the disk.
>
> Thanks,
> Sid
>
> On Sat, 2 Jul 2022, 17:58 krexos, <kre...@protonmail.com.invalid> wrote:
>
>> Hello,
>>
>> One of the main "selling points" of Spark is that unlike Hadoop
>> map-reduce that persists intermediate results of its computation to HDFS
>> (disk), Spark keeps all its results in memory. I don't understand this as
>> in reality when a Spark stage finishes it writes all of the data into
>> shuffle files stored on the disk
>> <https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md>.
>> How then is this an improvement on map-reduce?
>>
>> Image from https://youtu.be/7ooZ4S7Ay6Y
>>
>>
>> thanks!
>>
>
>

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

Reply via email to