Re: Difference between Sort based and Hash based shuffle

2015-08-19 Thread Muhammad Haseeb Javed
Thanks Andrew for a detailed response, So the reason why key value pairs with same keys are always found in a single buckets in Hash based shuffle but not in Sort is because in sort-shuffle each mapper writes a single partitioned file, and it is up to the reducer to fetch correct partitions from

Re: Difference between Sort based and Hash based shuffle

2015-08-19 Thread Andrew Or
Yes, in other words, a bucket is a single file in hash-based shuffle (no consolidation), but a segment of partitioned file in sort-based shuffle. 2015-08-19 5:52 GMT-07:00 Muhammad Haseeb Javed 11besemja...@seecs.edu.pk: Thanks Andrew for a detailed response, So the reason why key value pairs

Re: Difference between Sort based and Hash based shuffle

2015-08-18 Thread Andrew Or
Hi Muhammad, On a high level, in hash-based shuffle each mapper M writes R shuffle files, one for each reducer where R is the number of reduce partitions. This results in M * R shuffle files. Since it is not uncommon for M and R to be O(1000), this quickly becomes expensive. An optimization with

Re: Difference between Sort based and Hash based shuffle

2015-08-16 Thread Muhammad Haseeb Javed
I did check it out and although I did get a general understanding of the various classes used to implement Sort and Hash shuffles, however these slides lack details as to how they are implemented and why sort generally has better performance than hash On Sun, Aug 16, 2015 at 4:31 AM, Ravi Kiran

Re: Difference between Sort based and Hash based shuffle

2015-08-15 Thread Ravi Kiran
Have a look at this presentation. http://www.slideshare.net/colorant/spark-shuffle-introduction . Can be of help to you. On Sat, Aug 15, 2015 at 1:42 PM, Muhammad Haseeb Javed 11besemja...@seecs.edu.pk wrote: What are the major differences between how Sort based and Hash based shuffle operate

Difference between Sort based and Hash based shuffle

2015-08-15 Thread Muhammad Haseeb Javed
What are the major differences between how Sort based and Hash based shuffle operate and what is it that cause Sort Shuffle to perform better than Hash? Any talks that discuss both shuffles in detail, how they are implemented and the performance gains ?