On 2/6/08 11:58 AM, "Joydeep Sen Sarma" <[EMAIL PROTECTED]> wrote:

> 
>> But it actually adds duplicate data (i.e., the value column which
> needs 
>> sorting) to the key.
> 
> Why? U can always take it out of the value to remove the redundancy.
> 

Actually, you can't in most cases.

Suppose you have input data like this:

   a, b_1
   a, b_2
   a, b_1

And then the mapper produces data like this for each input record:

   a, b, 1
   a, *, 1
   a, b_2, 1
   a, *, 1
   a, b_1, 1
   a, *, 1

If you use the first two fields as the key so that you can sort the records
nicely, you get the following inputs to the reducer

   <a, *>, [3, 2, 1]

You now don't know what the counts go to except for the first one.  If you
replicate the second field in the value output of the map, then you get this

   <a, *>, [[*,3], [b_1, 2], [b_2, 1]]

And you can produce the desired output:

   a, b_1, 2/3
   a, b_2, 1/3

Reply via email to