Thank you very much-- this is exactly what I need! Chris
On 10/2/07, Owen O'Malley <[EMAIL PROTECTED]> wrote: > On Oct 1, 2007, at 6:05 PM, Chris Dyer wrote: > > > As of right now, I'm still having trouble determining how I can force > > the first element of the set that will be iterated over by a single > > reducer to be the marginal, and not some individual count. Does > > anyone know if Hadoop guarantees (can be made to guarantee) that the > > relative order of keys that are equal will be left unchanged? If so, > > this would be a fairly easy solution. > > There is not a guarantee of the reduce sort being stable in any > sense. (WIth the non-deterministic order of the map outputs being > available to the reduce, it wouldn't make that much sense.) > > There certainly isn't enough documentation about what is allowed for > sorting. I've filed a bug HADOOP-1981 to expand the Reducer java doc > to mention the JobConf methods that can control the sort order. In > particular, the methods are: > > setOutputKeyComparatorClass > setOutputValueGroupingComparator > > The first comparator controls the sort order of the keys. The second > controls which keys are grouped together into a single call to the > reduce method. The combination of these two allows you to set up jobs > that act like you've defined an order on the values. > > For example, say that you want to find duplicate web pages and tag > them all with the url of the "best" known example. You would set up > the job like: > > Map Input Key: url > Map Input Value: document > Map Output Key: document checksum, url pagerank > Map Output Value: url > Partitioner: by checksum > OutputKeyComparator: by checksum and then decreasing pagerank > OutputValueGroupingComparator: by checksum > > with this setup, the reduce function will be called exactly once with > each checksum, but the first value from the iterator will be the one > with the highest pagerank, which can then be used to tag the other > entries of the checksum family. > > -- Owen >