Comments inline ...

On 1/25/07, Bryan A. P. Pendleton <[EMAIL PROTECTED]> wrote:

If the output is already sorted, the sort pass *should* be able to run in
linear time - perhaps not worth optimizing it out for cases of sorted
output.


Agreed.  There's no reason why the framework can't detect this and
automatically skip the n log(n) sort.

Given the scatter/reassemble nature of the default map/reduce
(scatter by Partition, by default by the Hash), inputs that are sorted may
not be written as such to output..... so, if you're counting on sorted
data,
maybe it's best to leave the sort in (and verify that the current
infrastructure will perform well given sorted input). Otherwise, if there
is
no implication/need of sorted output, then sort can be totally disabled.


I guess when I say "sorted" I really mean aggregated.  If you know the input
is "aggregated" (e.g. it's the output of a previous Map/Reduce job) and the
map() function preserves this aggregation, then the step of building the
intermediate results from the map output should be done in linear time.

I do feel that this optimization is important.  As Adrezej points out, there
are quite a few applications that could benefit from this.  In particular,
the join example.  Without it, you can't really justify replacing your large
RDBMS system with a Map-Reduce (Bigtable) implementation.

- Doug

Reply via email to