Re: Difference between Sort based and Hash based shuffle

2015-08-19 Thread Muhammad Haseeb Javed
Thanks Andrew for a detailed response,

So the reason why key value pairs with same keys are always found in a
single buckets in Hash based shuffle but not in Sort is because in
sort-shuffle each mapper writes a single partitioned file, and it is up to
the reducer to fetch correct partitions from the the files ?

On Wed, Aug 19, 2015 at 2:13 AM, Andrew Or and...@databricks.com wrote:

 Hi Muhammad,

 On a high level, in hash-based shuffle each mapper M writes R shuffle
 files, one for each reducer where R is the number of reduce partitions.
 This results in M * R shuffle files. Since it is not uncommon for M and R
 to be O(1000), this quickly becomes expensive. An optimization with
 hash-based shuffle is consolidation, where all mappers run in the same core
 C write one file per reducer, resulting in C * R files. This is a strict
 improvement, but it is still relatively expensive.

 Instead, in sort-based shuffle each mapper writes a single partitioned
 file. This allows a particular reducer to request a specific portion of
 each mapper's single output file. In more detail, the mapper first fills up
 an internal buffer in memory and continually spills the contents of the
 buffer to disk, then finally merges all the spilled files together to form
 one final output file. This places much less stress on the file system and
 requires much fewer I/O operations especially on the read side.

 -Andrew



 2015-08-16 11:08 GMT-07:00 Muhammad Haseeb Javed 
 11besemja...@seecs.edu.pk:

 I did check it out and although I did get a general understanding of the
 various classes used to implement Sort and Hash shuffles, however these
 slides lack details as to how they are implemented and why sort generally
 has better performance than hash

 On Sun, Aug 16, 2015 at 4:31 AM, Ravi Kiran ravikiranmag...@gmail.com
 wrote:

 Have a look at this presentation.
 http://www.slideshare.net/colorant/spark-shuffle-introduction . Can be
 of help to you.

 On Sat, Aug 15, 2015 at 1:42 PM, Muhammad Haseeb Javed 
 11besemja...@seecs.edu.pk wrote:

 What are the major differences between how Sort based and Hash based
 shuffle operate and what is it that cause Sort Shuffle to perform better
 than Hash?
 Any talks that discuss both shuffles in detail, how they are
 implemented and the performance gains ?







Re: Difference between Sort based and Hash based shuffle

2015-08-19 Thread Andrew Or
Yes, in other words, a bucket is a single file in hash-based shuffle (no
consolidation), but a segment of partitioned file in sort-based shuffle.

2015-08-19 5:52 GMT-07:00 Muhammad Haseeb Javed 11besemja...@seecs.edu.pk:

 Thanks Andrew for a detailed response,

 So the reason why key value pairs with same keys are always found in a
 single buckets in Hash based shuffle but not in Sort is because in
 sort-shuffle each mapper writes a single partitioned file, and it is up to
 the reducer to fetch correct partitions from the the files ?

 On Wed, Aug 19, 2015 at 2:13 AM, Andrew Or and...@databricks.com wrote:

 Hi Muhammad,

 On a high level, in hash-based shuffle each mapper M writes R shuffle
 files, one for each reducer where R is the number of reduce partitions.
 This results in M * R shuffle files. Since it is not uncommon for M and R
 to be O(1000), this quickly becomes expensive. An optimization with
 hash-based shuffle is consolidation, where all mappers run in the same core
 C write one file per reducer, resulting in C * R files. This is a strict
 improvement, but it is still relatively expensive.

 Instead, in sort-based shuffle each mapper writes a single partitioned
 file. This allows a particular reducer to request a specific portion of
 each mapper's single output file. In more detail, the mapper first fills up
 an internal buffer in memory and continually spills the contents of the
 buffer to disk, then finally merges all the spilled files together to form
 one final output file. This places much less stress on the file system and
 requires much fewer I/O operations especially on the read side.

 -Andrew



 2015-08-16 11:08 GMT-07:00 Muhammad Haseeb Javed 
 11besemja...@seecs.edu.pk:

 I did check it out and although I did get a general understanding of the
 various classes used to implement Sort and Hash shuffles, however these
 slides lack details as to how they are implemented and why sort generally
 has better performance than hash

 On Sun, Aug 16, 2015 at 4:31 AM, Ravi Kiran ravikiranmag...@gmail.com
 wrote:

 Have a look at this presentation.
 http://www.slideshare.net/colorant/spark-shuffle-introduction . Can be
 of help to you.

 On Sat, Aug 15, 2015 at 1:42 PM, Muhammad Haseeb Javed 
 11besemja...@seecs.edu.pk wrote:

 What are the major differences between how Sort based and Hash based
 shuffle operate and what is it that cause Sort Shuffle to perform better
 than Hash?
 Any talks that discuss both shuffles in detail, how they are
 implemented and the performance gains ?








Re: Difference between Sort based and Hash based shuffle

2015-08-18 Thread Andrew Or
Hi Muhammad,

On a high level, in hash-based shuffle each mapper M writes R shuffle
files, one for each reducer where R is the number of reduce partitions.
This results in M * R shuffle files. Since it is not uncommon for M and R
to be O(1000), this quickly becomes expensive. An optimization with
hash-based shuffle is consolidation, where all mappers run in the same core
C write one file per reducer, resulting in C * R files. This is a strict
improvement, but it is still relatively expensive.

Instead, in sort-based shuffle each mapper writes a single partitioned
file. This allows a particular reducer to request a specific portion of
each mapper's single output file. In more detail, the mapper first fills up
an internal buffer in memory and continually spills the contents of the
buffer to disk, then finally merges all the spilled files together to form
one final output file. This places much less stress on the file system and
requires much fewer I/O operations especially on the read side.

-Andrew



2015-08-16 11:08 GMT-07:00 Muhammad Haseeb Javed 11besemja...@seecs.edu.pk
:

 I did check it out and although I did get a general understanding of the
 various classes used to implement Sort and Hash shuffles, however these
 slides lack details as to how they are implemented and why sort generally
 has better performance than hash

 On Sun, Aug 16, 2015 at 4:31 AM, Ravi Kiran ravikiranmag...@gmail.com
 wrote:

 Have a look at this presentation.
 http://www.slideshare.net/colorant/spark-shuffle-introduction . Can be
 of help to you.

 On Sat, Aug 15, 2015 at 1:42 PM, Muhammad Haseeb Javed 
 11besemja...@seecs.edu.pk wrote:

 What are the major differences between how Sort based and Hash based
 shuffle operate and what is it that cause Sort Shuffle to perform better
 than Hash?
 Any talks that discuss both shuffles in detail, how they are implemented
 and the performance gains ?






Re: Difference between Sort based and Hash based shuffle

2015-08-16 Thread Muhammad Haseeb Javed
I did check it out and although I did get a general understanding of the
various classes used to implement Sort and Hash shuffles, however these
slides lack details as to how they are implemented and why sort generally
has better performance than hash

On Sun, Aug 16, 2015 at 4:31 AM, Ravi Kiran ravikiranmag...@gmail.com
wrote:

 Have a look at this presentation.
 http://www.slideshare.net/colorant/spark-shuffle-introduction . Can be of
 help to you.

 On Sat, Aug 15, 2015 at 1:42 PM, Muhammad Haseeb Javed 
 11besemja...@seecs.edu.pk wrote:

 What are the major differences between how Sort based and Hash based
 shuffle operate and what is it that cause Sort Shuffle to perform better
 than Hash?
 Any talks that discuss both shuffles in detail, how they are implemented
 and the performance gains ?





Re: Difference between Sort based and Hash based shuffle

2015-08-15 Thread Ravi Kiran
Have a look at this presentation.
http://www.slideshare.net/colorant/spark-shuffle-introduction . Can be of
help to you.

On Sat, Aug 15, 2015 at 1:42 PM, Muhammad Haseeb Javed 
11besemja...@seecs.edu.pk wrote:

 What are the major differences between how Sort based and Hash based
 shuffle operate and what is it that cause Sort Shuffle to perform better
 than Hash?
 Any talks that discuss both shuffles in detail, how they are implemented
 and the performance gains ?



Difference between Sort based and Hash based shuffle

2015-08-15 Thread Muhammad Haseeb Javed
What are the major differences between how Sort based and Hash based
shuffle operate and what is it that cause Sort Shuffle to perform better
than Hash?
Any talks that discuss both shuffles in detail, how they are implemented
and the performance gains ?