ok - got it. This seems to be a subtle drawback in the reduce api. Keys
in the same reduce group may differ - but the reduce api does not make
it possible for the reducer function to get access to each key. It only
gets access to the starting key value for that group.

If the api was instead:

class KVpair { WritableComparable key, Writable value }
reduce(WritableComparable groupKey, Iterator<KVpair> keyvalues)

then we would be in good shape (since we can see the key and the value
and don't have to duplicate any data across them).

The underlying iterator has access to this data - it's just not
available through the api.

I suspect that these kinds of small optimizations are too complex to
make for a one-time job - but for any query language on top of hadoop -
it's a one time effort and probably worth it.

joydeep

-----Original Message-----
From: Ted Dunning [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, February 06, 2008 1:53 PM
To: core-user@hadoop.apache.org
Subject: Re: sort by value




On 2/6/08 11:58 AM, "Joydeep Sen Sarma" <[EMAIL PROTECTED]> wrote:

> 
>> But it actually adds duplicate data (i.e., the value column which
> needs 
>> sorting) to the key.
> 
> Why? U can always take it out of the value to remove the redundancy.
> 

Actually, you can't in most cases.

Suppose you have input data like this:

   a, b_1
   a, b_2
   a, b_1

And then the mapper produces data like this for each input record:

   a, b, 1
   a, *, 1
   a, b_2, 1
   a, *, 1
   a, b_1, 1
   a, *, 1

If you use the first two fields as the key so that you can sort the
records
nicely, you get the following inputs to the reducer

   <a, *>, [3, 2, 1]

You now don't know what the counts go to except for the first one.  If
you
replicate the second field in the value output of the map, then you get
this

   <a, *>, [[*,3], [b_1, 2], [b_2, 1]]

And you can produce the desired output:

   a, b_1, 2/3
   a, b_2, 1/3

Reply via email to