Comments inline ... On 1/25/07, Bryan A. P. Pendleton <[EMAIL PROTECTED]> wrote:
If the output is already sorted, the sort pass *should* be able to run in linear time - perhaps not worth optimizing it out for cases of sorted output.
Agreed. There's no reason why the framework can't detect this and automatically skip the n log(n) sort. Given the scatter/reassemble nature of the default map/reduce
(scatter by Partition, by default by the Hash), inputs that are sorted may not be written as such to output..... so, if you're counting on sorted data, maybe it's best to leave the sort in (and verify that the current infrastructure will perform well given sorted input). Otherwise, if there is no implication/need of sorted output, then sort can be totally disabled.
I guess when I say "sorted" I really mean aggregated. If you know the input is "aggregated" (e.g. it's the output of a previous Map/Reduce job) and the map() function preserves this aggregation, then the step of building the intermediate results from the map output should be done in linear time. I do feel that this optimization is important. As Adrezej points out, there are quite a few applications that could benefit from this. In particular, the join example. Without it, you can't really justify replacing your large RDBMS system with a Map-Reduce (Bigtable) implementation. - Doug
