So as per the discussion, shuffle stages output is also stored on disk and
not in memory?
On Sat, Jul 2, 2022 at 8:44 PM krexos wrote:
>
> thanks a lot!
>
> --- Original Message ---
> On Saturday, July 2nd, 2022 at 6:07 PM, Sean Owen
> wrote:
>
> I think that is more accurate yes.
thanks a lot!
--- Original Message ---
On Saturday, July 2nd, 2022 at 6:07 PM, Sean Owen wrote:
> I think that is more accurate yes. Though, shuffle files are local, not on
> distributed storage too, which is an advantage. MR also had map only
> transforms and chained mappers, but
I think that is more accurate yes. Though, shuffle files are local, not on
distributed storage too, which is an advantage. MR also had map only
transforms and chained mappers, but harder to use. Not impossible but you
could also say Spark just made it easier to do the more efficient thing.
On
You said Spark performs IO only when reading data and writing final data to the
disk. I though by that you meant that it only reads the input files of the job
and writes the output of the whole job to the disk, but in reality spark does
store intermediate results on disk, just in less places
I have explained the same thing in a very layman's terms. Go through it
once.
On Sat, 2 Jul 2022, 19:45 krexos, wrote:
>
> I think I understand where Spark saves IO.
>
> in MR we have map -> reduce -> map -> reduce -> map -> reduce ...
>
> which writes results do disk at the end of each such
I think I understand where Spark saves IO.
in MR we have map -> reduce -> map -> reduce -> map -> reduce ...
which writes results do disk at the end of each such "arrow",
on the other hand in spark we have
map -> reduce + map -> reduce + map -> reduce ...
which saves about 2 times the IO
Isn't Spark the same in this regard? You can execute all of the narrow
dependencies of a Spark stage in one mapper, thus having the same amount of
mappers + reducers as spark stages for the same job, no?
thanks,
krexos
--- Original Message ---
On Saturday, July 2nd, 2022 at 4:45 PM,
You're right. I suppose I just mean most operations don't need a shuffle -
you don't have 10 stages for 10 transformations. Also: caching in memory is
another way that memory is used to avoid IO.
On Sat, Jul 2, 2022, 8:42 AM krexos wrote:
> This doesn't add up with what's described in the
Don't stages by definition include a shuffle? If you didn't need a shuffle
between 2 stages you could merge them into one stage.
thanks,
krexos
--- Original Message ---
On Saturday, July 2nd, 2022 at 4:13 PM, Sean Owen wrote:
> Because only shuffle stages write shuffle files. Most
This doesn't add up with what's described in the internals page I included.
What you are talking about is shuffle spills at the beginning of the stage.
What I am talking about is that at the end of the stage spark writes all of the
stage's results to shuffle files on disk, thus we will have the
Because only shuffle stages write shuffle files. Most stages are not
shuffles
On Sat, Jul 2, 2022, 7:28 AM krexos wrote:
> Hello,
>
> One of the main "selling points" of Spark is that unlike Hadoop map-reduce
> that persists intermediate results of its computation to HDFS (disk), Spark
> keeps
Hi Krexos,
If I understand correctly, you are trying to ask that even spark involves
disk i/o then how it is an advantage over map reduce.
Basically, Map Reduce phase writes every intermediate results to the disk.
So on an average it involves 6 times disk I/O whereas spark(assuming it has
an
Hello,
One of the main "selling points" of Spark is that unlike Hadoop map-reduce that
persists intermediate results of its computation to HDFS (disk), Spark keeps
all its results in memory. I don't understand this as in reality when a Spark
stage finishes[it writes all of the data into
13 matches
Mail list logo