First of all, define "far outperforming". For sure, there is no GOD
system that does everything perfectly.
In which use-cases are you referring to? It would be interesting to the
community to see some comparisons.
a.
On 5/7/22 12:29, Gourav Sengupta wrote:
Hi,
SPARK is just one of the technologies out there now, there are several
other technologies far outperforming SPARK or at least as good as SPARK.
Regards,
Gourav
On Sat, Jul 2, 2022 at 7:42 PM Sid <flinkbyhe...@gmail.com> wrote:
So as per the discussion, shuffle stages output is also stored on
disk and not in memory?
On Sat, Jul 2, 2022 at 8:44 PM krexos <kre...@protonmail.com> wrote:
thanks a lot!
------- Original Message -------
On Saturday, July 2nd, 2022 at 6:07 PM, Sean Owen
<sro...@gmail.com> wrote:
I think that is more accurate yes. Though, shuffle files are
local, not on distributed storage too, which is an advantage.
MR also had map only transforms and chained mappers, but
harder to use. Not impossible but you could also say Spark
just made it easier to do the more efficient thing.
On Sat, Jul 2, 2022, 9:34 AM krexos
<kre...@protonmail.com.invalid> wrote:
You said Spark performs IO only when reading data and
writing final data to the disk. I though by that you
meant that it only reads the input files of the job and
writes the output of the whole job to the disk, but in
reality spark does store intermediate results on disk,
just in less places than MR
------- Original Message -------
On Saturday, July 2nd, 2022 at 5:27 PM, Sid
<flinkbyhe...@gmail.com> wrote:
I have explained the same thing in a very layman's
terms. Go through it once.
On Sat, 2 Jul 2022, 19:45 krexos,
<kre...@protonmail.com.invalid> wrote:
I think I understand where Spark saves IO.
in MR we have map -> reduce -> map -> reduce -> map
-> reduce ...
which writes results do disk at the end of each such
"arrow",
on the other hand in spark we have
map -> reduce + map -> reduce + map -> reduce ...
which saves about 2 times the IO
thanks everyone,
krexos
------- Original Message -------
On Saturday, July 2nd, 2022 at 1:35 PM, krexos
<kre...@protonmail.com> wrote:
Hello,
One of the main "selling points" of Spark is that
unlike Hadoop map-reduce that persists intermediate
results of its computation to HDFS (disk), Spark
keeps all its results in memory. I don't understand
this as in reality when a Spark stage finishesit
writes all of the data into shuffle files stored on
the disk
<https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/4-shuffleDetails.md>.
How then is this an improvement on map-reduce?
Image from https://youtu.be/7ooZ4S7Ay6Y
thanks!
--
Apostolos N. Papadopoulos, Associate Professor
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, GREECE
tel: ++0030312310991918
email:papad...@csd.auth.gr
twitter: @papadopoulos_ap
web:http://datalab.csd.auth.gr/~apostol