I didn't find any "secondary value sorter" either. You can play one workaround by joining your cache and Input with CompositeInputFormat if both of them are too large; but you need to sort both of them with equal partition before joining.
There is another join util in contrib/data_join, it is not quite performant compared to map side join. Regards Mice 2008/10/29 Mark Tozzi <[EMAIL PROTECTED]>: > Greetings Hadoop users, > > I'm relatively new to MapReduce (I've been working on my own with the Hadoop > code for about a month and a half now), and I'm having difficulty with how > the values for a given key are passed to the reducer. > > As per the API, the reducer expects a single Key and an iterator over a > collection of values. Is there any way to specify that the iterator (or the > underlining collection, perhaps) be sorted? At the moment, the values seem > to be in random order. It seems like something to do with the sequence file > the mapper writes its intermediate output to could sort this, but I can't > find anything in the documentation that illuminates how to do this. > > To clarify, within each key (which I realize are already sorted and grouped > to be handed to the reducers), I would like the values list sorted relative > to itself. > > The application I am working on is basically a normalizer for a large set of > data files. My mapper initially breaks out a line of the input into a bunch > of (FieldName, FieldValue) pairs. Each reducer then operates on a single > FieldName (and thus is assured of having all values for that field), and > compares this against a (sorted) list from the distributed cache of known > values for that field. If a field value is not in the cache, a new id is > generated and the value-id pair is sent to the output. For small data sets, > I can load the whole cache into a HashSet or similar, but for large data > sets that is not practical. If both the input list and the cache were > sorted, I would only ever need to keep the top value of each in memory - a > huge efficiency gain. > > Thanks in advance for any assistance you can provide. > > -- > Mark Tozzi > Developer - Business Systems Team > About > The Answer is...About.com > www.about.com > ph: 212-204-2863 fax: 212-204-1684 > aim: markatabout > About.com is part of The New York Times Company >