Re: almost sorted map output

Owen O'Malley Mon, 31 May 2010 20:12:14 -0700

On Fri, May 28, 2010 at 10:58 PM, juber patel <[email protected]> wrote:
> Hello,
>
> Can Hadoop take advantage of the fact that the output of each map task
> is almost sorted?


The map output sort is efficient for mostly sorted output. The
dominant cost is the transfer costs of the shuffle and that won't be
helped by almost sorted output. There is an optimization that we have
discussed for years but not implemented where if you have lop-sided
partitions (such as largely sorted data and a total order
partitioner), you schedule the reduce close to the majority of its
data. In some applications that would be a large win.

> On a related note, Does Hadoop's Quicksort implementation give worst
> case performance on almost sorted data? Should I use Heapsort in its
> place?

The quick sort used on the map side is careful about pivots and the
resulting partitions. I wouldn't worry about replacing it.

Re: almost sorted map output

Reply via email to