Re: How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-02 Thread Sid
So as per the discussion, shuffle stages output is also stored on disk and not in memory? On Sat, Jul 2, 2022 at 8:44 PM krexos wrote: > > thanks a lot! > > --- Original Message --- > On Saturday, July 2nd, 2022 at 6:07 PM, Sean Owen > wrote: > > I think that is more accurate yes.

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-02 Thread krexos
thanks a lot! --- Original Message --- On Saturday, July 2nd, 2022 at 6:07 PM, Sean Owen wrote: > I think that is more accurate yes. Though, shuffle files are local, not on > distributed storage too, which is an advantage. MR also had map only > transforms and chained mappers, but

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-02 Thread Sean Owen
I think that is more accurate yes. Though, shuffle files are local, not on distributed storage too, which is an advantage. MR also had map only transforms and chained mappers, but harder to use. Not impossible but you could also say Spark just made it easier to do the more efficient thing. On

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-02 Thread krexos
You said Spark performs IO only when reading data and writing final data to the disk. I though by that you meant that it only reads the input files of the job and writes the output of the whole job to the disk, but in reality spark does store intermediate results on disk, just in less places

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-02 Thread Sid
I have explained the same thing in a very layman's terms. Go through it once. On Sat, 2 Jul 2022, 19:45 krexos, wrote: > > I think I understand where Spark saves IO. > > in MR we have map -> reduce -> map -> reduce -> map -> reduce ... > > which writes results do disk at the end of each such

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-02 Thread krexos
I think I understand where Spark saves IO. in MR we have map -> reduce -> map -> reduce -> map -> reduce ... which writes results do disk at the end of each such "arrow", on the other hand in spark we have map -> reduce + map -> reduce + map -> reduce ... which saves about 2 times the IO

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-02 Thread krexos
Isn't Spark the same in this regard? You can execute all of the narrow dependencies of a Spark stage in one mapper, thus having the same amount of mappers + reducers as spark stages for the same job, no? thanks, krexos --- Original Message --- On Saturday, July 2nd, 2022 at 4:45 PM,

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-02 Thread Sean Owen
You're right. I suppose I just mean most operations don't need a shuffle - you don't have 10 stages for 10 transformations. Also: caching in memory is another way that memory is used to avoid IO. On Sat, Jul 2, 2022, 8:42 AM krexos wrote: > This doesn't add up with what's described in the

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-02 Thread krexos
Don't stages by definition include a shuffle? If you didn't need a shuffle between 2 stages you could merge them into one stage. thanks, krexos --- Original Message --- On Saturday, July 2nd, 2022 at 4:13 PM, Sean Owen wrote: > Because only shuffle stages write shuffle files. Most

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-02 Thread krexos
This doesn't add up with what's described in the internals page I included. What you are talking about is shuffle spills at the beginning of the stage. What I am talking about is that at the end of the stage spark writes all of the stage's results to shuffle files on disk, thus we will have the

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-02 Thread Sean Owen
Because only shuffle stages write shuffle files. Most stages are not shuffles On Sat, Jul 2, 2022, 7:28 AM krexos wrote: > Hello, > > One of the main "selling points" of Spark is that unlike Hadoop map-reduce > that persists intermediate results of its computation to HDFS (disk), Spark > keeps

Re: How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-02 Thread Sid
Hi Krexos, If I understand correctly, you are trying to ask that even spark involves disk i/o then how it is an advantage over map reduce. Basically, Map Reduce phase writes every intermediate results to the disk. So on an average it involves 6 times disk I/O whereas spark(assuming it has an

How is Spark a memory based solution if it writes data to disk before shuffles?

2022-07-02 Thread krexos
Hello, One of the main "selling points" of Spark is that unlike Hadoop map-reduce that persists intermediate results of its computation to HDFS (disk), Spark keeps all its results in memory. I don't understand this as in reality when a Spark stage finishes[it writes all of the data into