On Fri, May 28, 2010 at 10:58 PM, juber patel <[email protected]> wrote: > Hello, > > Can Hadoop take advantage of the fact that the output of each map task > is almost sorted?
The map output sort is efficient for mostly sorted output. The dominant cost is the transfer costs of the shuffle and that won't be helped by almost sorted output. There is an optimization that we have discussed for years but not implemented where if you have lop-sided partitions (such as largely sorted data and a total order partitioner), you schedule the reduce close to the majority of its data. In some applications that would be a large win. > On a related note, Does Hadoop's Quicksort implementation give worst > case performance on almost sorted data? Should I use Heapsort in its > place? The quick sort used on the map side is careful about pivots and the resulting partitions. I wouldn't worry about replacing it.
