Re: How is Spark a memory based solution if it writes data to disk before shuffles?

Apostolos N. Papadopoulos Tue, 05 Jul 2022 02:39:02 -0700

First of all, define "far outperforming". For sure, there is no GODsystem that does everything perfectly.

In which use-cases are you referring to? It would be interesting to thecommunity to see some comparisons.


a.


On 5/7/22 12:29, Gourav Sengupta wrote:

Hi,

SPARK is just one of the technologies out there now, there are severalother technologies far outperforming SPARK or at least as good as SPARK.




Regards,
Gourav

On Sat, Jul 2, 2022 at 7:42 PM Sid <flinkbyhe...@gmail.com> wrote:

    So as per the discussion, shuffle stages output is also stored on
    disk and not in memory?

    On Sat, Jul 2, 2022 at 8:44 PM krexos <kre...@protonmail.com> wrote:


        thanks a lot!

        ------- Original Message -------
        On Saturday, July 2nd, 2022 at 6:07 PM, Sean Owen
        <sro...@gmail.com> wrote:

        I think that is more accurate yes. Though, shuffle files are
        local, not on distributed storage too, which is an advantage.
        MR also had map only transforms and chained mappers, but
        harder to use. Not impossible but you could also say Spark
        just made it easier to do the more efficient thing.

        On Sat, Jul 2, 2022, 9:34 AM krexos
        <kre...@protonmail.com.invalid> wrote:


            You said Spark performs IO only when reading data and
            writing final data to the disk. I though by that you
            meant that it only reads the input files of the job and
            writes the output of the whole job to the disk, but in
            reality spark does store intermediate results on disk,
            just in less places than MR

            ------- Original Message -------
            On Saturday, July 2nd, 2022 at 5:27 PM, Sid
            <flinkbyhe...@gmail.com> wrote:

            I have explained the same thing in a very layman's
            terms. Go through it once.

            On Sat, 2 Jul 2022, 19:45 krexos,
            <kre...@protonmail.com.invalid> wrote:


                I think I understand where Spark saves IO.

                in MR we have map -> reduce -> map -> reduce -> map
                -> reduce ...

                which writes results do disk at the end of each such
                "arrow",

                on the other hand in spark we have

                map -> reduce + map -> reduce + map -> reduce ...

                which saves about 2 times the IO

                thanks everyone,
                krexos

                ------- Original Message -------
                On Saturday, July 2nd, 2022 at 1:35 PM, krexos
                <kre...@protonmail.com> wrote:

                Hello,

                One of the main "selling points" of Spark is that
                unlike Hadoop map-reduce that persists intermediate
                results of its computation to HDFS (disk), Spark
                keeps all its results in memory. I don't understand
                this as in reality when a Spark stage finishesit
                writes all of the data into shuffle files stored on
                the disk
                
<https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md>.
                How then is this an improvement on map-reduce?

                Image from https://youtu.be/7ooZ4S7Ay6Y


                thanks!

--
Apostolos N. Papadopoulos, Associate Professor
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, GREECE
tel: ++0030312310991918
email:papad...@csd.auth.gr
twitter: @papadopoulos_ap
web:http://datalab.csd.auth.gr/~apostol

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

Reply via email to