On Wed, Jan 24, 2007 at 05:32:20PM -0800, Doug Judd wrote: > After digging into this a bit, it looks like the use of IdentityReducer does > not disable the sort. I wrote a simple Map/Reduce program that uses > /usr/share/dict/words as input and generates keys that are a Text > representation of the CRC of the word modulo 65536 and values that are the > word itself. I set the reducer to be the IdentityReducer and the output > came out sorted: >
It doesn't disable the sort, but Andrzej's comment still holds: > >:) Sure, that's one point of view on this - however, in quite a few > >applications sort is definitely less important than the ability to > >split the processing load in map() and reduce() over many machines. > >Sometimes I don't care about the sorting at all (in all cases where > >IdentityReducer is used). When you say "MapReduce is just distributed sort," that makes it sound like people use MapReduce because they want a distributed sort. The fact of the matter is that there are plenty of other reasons to use MapReduce, including load balancing, fault tolerence, etc. In the majority of the cases where a sort is needed, it's really just an implementation detail; if the majority of the work being done is in the map function, it doesn't make sense to put so much importance on the sort that follows it. In general, I don't agree with your statement. However, if you need to sort a large number of items, running a MapReduce job with identity map and identity reduce would be a very simple way to do it. -- Andrew McNabb http://www.mcnabbs.org/andrew/ PGP Fingerprint: 8A17 B57C 6879 1863 DE55 8012 AB4D 6098 8826 6868
signature.asc
Description: Digital signature
