Re: How is Spark a memory based solution if it writes data to disk before shuffles?

krexos Sat, 02 Jul 2022 08:15:16 -0700

thanks a lot!

------- Original Message -------
On Saturday, July 2nd, 2022 at 6:07 PM, Sean Owen <sro...@gmail.com> wrote:


> I think that is more accurate yes. Though, shuffle files are local, not on 
> distributed storage too, which is an advantage. MR also had map only 
> transforms and chained mappers, but harder to use. Not impossible but you 
> could also say Spark just made it easier to do the more efficient thing.
>
> On Sat, Jul 2, 2022, 9:34 AM krexos <kre...@protonmail.com.invalid> wrote:
>
>> You said Spark performs IO only when reading data and writing final data to 
>> the disk. I though by that you meant that it only reads the input files of 
>> the job and writes the output of the whole job to the disk, but in reality 
>> spark does store intermediate results on disk, just in less places than MR
>>
>> ------- Original Message -------
>> On Saturday, July 2nd, 2022 at 5:27 PM, Sid <flinkbyhe...@gmail.com> wrote:
>>
>>> I have explained the same thing in a very layman's terms. Go through it 
>>> once.
>>>
>>> On Sat, 2 Jul 2022, 19:45 krexos, <kre...@protonmail.com.invalid> wrote:
>>>
>>>> I think I understand where Spark saves IO.
>>>>
>>>> in MR we have map -> reduce -> map -> reduce -> map -> reduce ...
>>>>
>>>> which writes results do disk at the end of each such "arrow",
>>>>
>>>> on the other hand in spark we have
>>>>
>>>> map -> reduce + map -> reduce + map -> reduce ...
>>>>
>>>> which saves about 2 times the IO
>>>>
>>>> thanks everyone,
>>>> krexos
>>>>
>>>> ------- Original Message -------
>>>> On Saturday, July 2nd, 2022 at 1:35 PM, krexos <kre...@protonmail.com> 
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> One of the main "selling points" of Spark is that unlike Hadoop 
>>>>> map-reduce that persists intermediate results of its computation to HDFS 
>>>>> (disk), Spark keeps all its results in memory. I don't understand this as 
>>>>> in reality when a Spark stage finishes[it writes all of the data into 
>>>>> shuffle files stored on the 
>>>>> disk](https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md).
>>>>>  How then is this an improvement on map-reduce?
>>>>>
>>>>> Image from https://youtu.be/7ooZ4S7Ay6Y
>>>>>
>>>>> thanks!

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

Reply via email to