Re: How is Spark a memory based solution if it writes data to disk before shuffles?

Sid Sat, 02 Jul 2022 11:41:51 -0700

So as per the discussion, shuffle stages output is also stored on disk and
not in memory?


On Sat, Jul 2, 2022 at 8:44 PM krexos <kre...@protonmail.com> wrote:

>
> thanks a lot!
>
> ------- Original Message -------
> On Saturday, July 2nd, 2022 at 6:07 PM, Sean Owen <sro...@gmail.com>
> wrote:
>
> I think that is more accurate yes. Though, shuffle files are local, not on
> distributed storage too, which is an advantage. MR also had map only
> transforms and chained mappers, but harder to use. Not impossible but you
> could also say Spark just made it easier to do the more efficient thing.
>
> On Sat, Jul 2, 2022, 9:34 AM krexos <kre...@protonmail.com.invalid> wrote:
>
>>
>> You said Spark performs IO only when reading data and writing final data
>> to the disk. I though by that you meant that it only reads the input files
>> of the job and writes the output of the whole job to the disk, but in
>> reality spark does store intermediate results on disk, just in less places
>> than MR
>>
>> ------- Original Message -------
>> On Saturday, July 2nd, 2022 at 5:27 PM, Sid <flinkbyhe...@gmail.com>
>> wrote:
>>
>> I have explained the same thing in a very layman's terms. Go through it
>> once.
>>
>> On Sat, 2 Jul 2022, 19:45 krexos, <kre...@protonmail.com.invalid> wrote:
>>
>>>
>>> I think I understand where Spark saves IO.
>>>
>>> in MR we have map -> reduce -> map -> reduce -> map -> reduce ...
>>>
>>> which writes results do disk at the end of each such "arrow",
>>>
>>> on the other hand in spark we have
>>>
>>> map -> reduce + map -> reduce + map -> reduce ...
>>>
>>> which saves about 2 times the IO
>>>
>>> thanks everyone,
>>> krexos
>>>
>>> ------- Original Message -------
>>> On Saturday, July 2nd, 2022 at 1:35 PM, krexos <kre...@protonmail.com>
>>> wrote:
>>>
>>> Hello,
>>>
>>> One of the main "selling points" of Spark is that unlike Hadoop
>>> map-reduce that persists intermediate results of its computation to HDFS
>>> (disk), Spark keeps all its results in memory. I don't understand this as
>>> in reality when a Spark stage finishes it writes all of the data into
>>> shuffle files stored on the disk
>>> <https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md>.
>>> How then is this an improvement on map-reduce?
>>>
>>> Image from https://youtu.be/7ooZ4S7Ay6Y
>>>
>>>
>>> thanks!
>>>
>>>
>>>
>>
>

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

Reply via email to