Re: Partitioner vs GroupComparator

Lukavsky, Jan Fri, 23 Aug 2013 10:36:20 -0700

Hi Shahab, I'm not sure if I understand right, but the problem is that you need 
to put the data you want to secondary sort into your key class. But, what I 
just realized is that the original key probably IS accessible, because of the 
Writable semantics. As you iterate through the Iterable passed to the reduce 
call the Key changes its contents. Am I right? This seems a bit weird but 
probably is how it works. I just overlooked this, because of the way the API 
looks and how all howtos on doing secondary sort look. All I have seen 
duplicate the secondary part of the key in value.

Jan

-------- Original message --------
Subject: Re: Partitioner vs GroupComparator
From: Shahab Yunus <shahab.yu...@gmail.com>
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
CC:

@Jan, why not, not send the 'hidden' part of the key as a value? Why not then 
pass value as null or with some other value part. So in the reducer side there 
is no duplication and you can extract the 'hidden' part of the key yourself 
(which should be possible as you will be encapsulating it in a some 
class/object model...?

Regards,
Shahab

On Fri, Aug 23, 2013 at 12:22 PM, Jan Lukavský 
<jan.lukav...@firma.seznam.cz<mailto:jan.lukav...@firma.seznam.cz>> wrote:
Hi all,

when speaking about this, has anyone ever measured how much more data needs to 
be transferred over the network when using GroupingComparator the way Harsh 
suggests? What do I mean, when you use the GroupingComparator, it hides you the 
real key that you have emitted from Mapper. You just see the first key in the 
reduce group and any data that was carried in the key needs to be duplicated in 
the value in order to be accessible on the reduce end.

Let's say you have key consisting of two parts (base, extension), you partition 
by the 'base' part and use GroupingComparator to group keys with the same base 
part. Than you have no other chance than to emit from Mapper something like 
this - (key: (base, extension), value: extension), which means the 'extension' 
part is duplicated in the data, that has to be transferred over the network. 
This overhead can be diminished by using compression between map and reduce 
side, but I believe that in some cases this can be significant.

It would be nice if the API allowed to access the 'real' key for each value, 
not only the first key of the reduce group. The only

Re: Partitioner vs GroupComparator

Reply via email to