I didn't find any "secondary value sorter" either.

You can play one workaround by joining your cache and Input with
CompositeInputFormat if both of them are too large; but you need to
sort both of them with equal partition before joining.

There is another join util in contrib/data_join, it is not quite
performant compared to map side join.

Regards
Mice

2008/10/29 Mark Tozzi <[EMAIL PROTECTED]>:
> Greetings Hadoop users,
>
> I'm relatively new to MapReduce (I've been working on my own with the Hadoop
> code for about a month and a half now), and I'm having difficulty with how
> the values for a given key are passed to the reducer.
>
> As per the API, the reducer expects a single Key and an iterator over a
> collection of values.  Is there any way to specify that the iterator (or the
> underlining collection, perhaps) be sorted?  At the moment, the values seem
> to be in random order.  It seems like something to do with the sequence file
> the mapper writes its intermediate output to could sort this, but I can't
> find anything in the documentation that illuminates how to do this.
>
> To clarify, within each key (which I realize are already sorted and grouped
> to be handed to the reducers), I would like the values list sorted relative
> to itself.
>
> The application I am working on is basically a normalizer for a large set of
> data files.  My mapper initially breaks out a line of the input into a bunch
> of (FieldName, FieldValue) pairs.  Each reducer then operates on a single
> FieldName (and thus is assured of having all values for that field), and
> compares this against a (sorted) list from the distributed cache of known
> values for that field.  If a field value is not in the cache, a new id is
> generated and the value-id pair is sent to the output.  For small data sets,
> I can load the whole cache into a HashSet or similar, but for large data
> sets that is not practical.  If both the input list and the cache were
> sorted, I would only ever need to keep the top value of each in memory - a
> huge efficiency gain.
>
> Thanks in advance for any assistance you can provide.
>
> --
> Mark Tozzi
> Developer - Business Systems Team
> About
> The Answer is...About.com
> www.about.com
> ph: 212-204-2863 fax: 212-204-1684
> aim: markatabout
> About.com is part of The New York Times Company
>

Reply via email to