After digging into this a bit, it looks like the use of IdentityReducer does not disable the sort. I wrote a simple Map/Reduce program that uses /usr/share/dict/words as input and generates keys that are a Text representation of the CRC of the word modulo 65536 and values that are the word itself. I set the reducer to be the IdentityReducer and the output came out sorted:
0 apperceptively 0 Connarus 1 overfold 1 derationalization 1 gymnasium 10 respecting 10 supperwards 100 cellulofibrous 100 drogherman 100 heteroptics 1000 bacao 1000 Cumaean 1000 didymate 1000 disbelieving 1001 polymer 1001 salveline 1001 workwomanly 1002 sporty 1002 bakal 1003 preferentialist Also, after reviewing the Google paper, they make no mention of the sort being disabled by the Identity reducer. In fact, they describe their Sort implementation as using the identity reducer. Unless I'm missing something, I retract my previous statement. Map-Reduce is really just distributed sort. I do think that being able to disable the sort is a much needed enhancement, especially since quite a few applications don't need it. - Doug On 1/24/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
Doug Judd wrote: > Part of the problem is that calling the paradigm "Map-Reduce" is somewhat > misleading. It is really just a distributed sort. The sort is where > all of > the complexity comes from. Invoking map() over the input is O(n), > invoking > reduce() over the intermediate results is O(n) as well. The sort is > O(nlogn). A more appropriate name for this algorithm would be > "Distributed > Sort with a Pre-map Phase and a Post-reduce Phase" Calling it Map-Reduce > and leaving out the word "sort" (the most important part) is a source of > confusion. > > If you think of it in these terms, I think it's easier to see where > and how > it applies. :) Sure, that's one point of view on this - however, in quite a few applications sort is definitely less important than the ability to split the processing load in map() and reduce() over many machines. Sometimes I don't care about the sorting at all (in all cases where IdentityReducer is used). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
