Re: I keep getting multiple values for unique reduce keys

Rick Ross Mon, 05 Sep 2011 22:11:56 -0700

I'm still poking around on this and I was wondering if there is a way to see 
the intermediate files that the mapper writes and the ones that the reducer 
reads.    I might get some clues in there.


Thanks

R

On Sep 4, 2011, at 10:14 PM, Rick Ross wrote:

> Thanks, but unless I misread you, that didn't do it.     Naturally the object 
> that I am creating just has a couple of ArrayLists to gather up Name and Type 
> objects.   
> 
> I suspect I need to extend ArrayWritable instead.   I'll try that next.  
> 
> Cheers.
> 
> R
> 
> On Sep 4, 2011, at 9:37 PM, Sudharsan Sampath wrote:
> 
>> Hi,
>> 
>> I suspect it's something to do with your custom Writable. Do you have a 
>> clear method on your container? If so, that should be used before the obj is 
>> initialized every time to avoid retaining previous values due to object 
>> reuse during ser-de process.
>> 
>> Thanks
>> Sudhan S
>> 
>> 
>> 
>> On Mon, Sep 5, 2011 at 6:11 AM, Rick Ross <r...@semanticresearch.com> wrote:
>> Hi all,
>> 
>> I have ensured that my mapper produces a unique key for every value it 
>> writes and further more that each map() call only writes one value.    I 
>> note here that the value is a custom for which I handle the Writable 
>> interface methods.
>> 
>> I realize that it isn't very real world to have (well, want) no combining 
>> done prior to reducing, but I'm still getting my feet wet.
>> 
>> When the reducer runs, I expected to see one reduce() call for every map() 
>> call, and I do.    However, the value I get is the composite of all the 
>> reduce() calls that came before it.
>> 
>> So, for example, the mapper gets data like this :
>> 
>>   ID,     Name,          Type,          Other stuff...
>>   A000,   Cream,         Group,         ...
>>   B231,   Led Zeppelin,  Group,         ...
>>   A044,   Liberace,      Individual,    ...
>> 
>> 
>> ID is the external key from the source data and is guaranteed to be unique.
>> 
>> When I map it, I create a container for the row data and output that 
>> container with all the data from that row only and use the ID field as a key.
>> 
>> Since the key is always unique I expected the sort/shuffle step to never 
>> coalesce any two values.    So I expected my reduce() method to be called 
>> once per mapped input row, and it is.
>> 
>> The problem is, as each row is processed, the reducer sees a set of 
>> cumulative value data instead of a container with a row of data in it.  So 
>> the 'value' parameter to reduce always has the information from previous 
>> reduce steps.
>> 
>> For example, given the data above :
>> 
>> 1st Reducer Call :
>>   Key = A000
>>   Value =
>>       Container :
>>          (object 1) : Name = Cream, Type = Group, MBID = A000, ...
>> 
>> 2nd Reducer Call :
>>   Key = B231
>>   Value =
>>       Container :
>>          (object 1) : Name = Led Zeppelin, Type = Group, MBID = B231, ...
>>          (object 2) : Name = Cream, Type = Group, MBID = A000, ...
>> 
>> So the second reduce call has data in it from the first reduce call.   Very 
>> strange!   At a guess I would say the reducer is re-using the object when it 
>> reads the objects back from the mapping step.  I dunno..
>> 
>> If anyone has any ideas, I'm open to suggestions.      0.20.2-cdh3u0
>> 
>> Thanks!
>> 
>> R
>> 
>> 
>> 
>> 
>

Re: I keep getting multiple values for unique reduce keys

Reply via email to