Re: How is Spark a memory based solution if it writes data to disk before shuffles?

Sean Owen Sat, 02 Jul 2022 08:07:30 -0700

I think that is more accurate yes. Though, shuffle files are local, not on
distributed storage too, which is an advantage. MR also had map only
transforms and chained mappers, but harder to use. Not impossible but you
could also say Spark just made it easier to do the more efficient thing.


On Sat, Jul 2, 2022, 9:34 AM krexos <kre...@protonmail.com.invalid> wrote:

>
> You said Spark performs IO only when reading data and writing final data
> to the disk. I though by that you meant that it only reads the input files
> of the job and writes the output of the whole job to the disk, but in
> reality spark does store intermediate results on disk, just in less places
> than MR
>
> ------- Original Message -------
> On Saturday, July 2nd, 2022 at 5:27 PM, Sid <flinkbyhe...@gmail.com>
> wrote:
>
> I have explained the same thing in a very layman's terms. Go through it
> once.
>
> On Sat, 2 Jul 2022, 19:45 krexos, <kre...@protonmail.com.invalid> wrote:
>
>>
>> I think I understand where Spark saves IO.
>>
>> in MR we have map -> reduce -> map -> reduce -> map -> reduce ...
>>
>> which writes results do disk at the end of each such "arrow",
>>
>> on the other hand in spark we have
>>
>> map -> reduce + map -> reduce + map -> reduce ...
>>
>> which saves about 2 times the IO
>>
>> thanks everyone,
>> krexos
>>
>> ------- Original Message -------
>> On Saturday, July 2nd, 2022 at 1:35 PM, krexos <kre...@protonmail.com>
>> wrote:
>>
>> Hello,
>>
>> One of the main "selling points" of Spark is that unlike Hadoop
>> map-reduce that persists intermediate results of its computation to HDFS
>> (disk), Spark keeps all its results in memory. I don't understand this as
>> in reality when a Spark stage finishes it writes all of the data into
>> shuffle files stored on the disk
>> <https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md>.
>> How then is this an improvement on map-reduce?
>>
>> Image from https://youtu.be/7ooZ4S7Ay6Y
>>
>>
>> thanks!
>>
>>
>>
>

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

Reply via email to