I'm still poking around on this and I was wondering if there is a way to see the intermediate files that the mapper writes and the ones that the reducer reads. I might get some clues in there.
Thanks R On Sep 4, 2011, at 10:14 PM, Rick Ross wrote: > Thanks, but unless I misread you, that didn't do it. Naturally the object > that I am creating just has a couple of ArrayLists to gather up Name and Type > objects. > > I suspect I need to extend ArrayWritable instead. I'll try that next. > > Cheers. > > R > > On Sep 4, 2011, at 9:37 PM, Sudharsan Sampath wrote: > >> Hi, >> >> I suspect it's something to do with your custom Writable. Do you have a >> clear method on your container? If so, that should be used before the obj is >> initialized every time to avoid retaining previous values due to object >> reuse during ser-de process. >> >> Thanks >> Sudhan S >> >> >> >> On Mon, Sep 5, 2011 at 6:11 AM, Rick Ross <r...@semanticresearch.com> wrote: >> Hi all, >> >> I have ensured that my mapper produces a unique key for every value it >> writes and further more that each map() call only writes one value. I >> note here that the value is a custom for which I handle the Writable >> interface methods. >> >> I realize that it isn't very real world to have (well, want) no combining >> done prior to reducing, but I'm still getting my feet wet. >> >> When the reducer runs, I expected to see one reduce() call for every map() >> call, and I do. However, the value I get is the composite of all the >> reduce() calls that came before it. >> >> So, for example, the mapper gets data like this : >> >> ID, Name, Type, Other stuff... >> A000, Cream, Group, ... >> B231, Led Zeppelin, Group, ... >> A044, Liberace, Individual, ... >> >> >> ID is the external key from the source data and is guaranteed to be unique. >> >> When I map it, I create a container for the row data and output that >> container with all the data from that row only and use the ID field as a key. >> >> Since the key is always unique I expected the sort/shuffle step to never >> coalesce any two values. So I expected my reduce() method to be called >> once per mapped input row, and it is. >> >> The problem is, as each row is processed, the reducer sees a set of >> cumulative value data instead of a container with a row of data in it. So >> the 'value' parameter to reduce always has the information from previous >> reduce steps. >> >> For example, given the data above : >> >> 1st Reducer Call : >> Key = A000 >> Value = >> Container : >> (object 1) : Name = Cream, Type = Group, MBID = A000, ... >> >> 2nd Reducer Call : >> Key = B231 >> Value = >> Container : >> (object 1) : Name = Led Zeppelin, Type = Group, MBID = B231, ... >> (object 2) : Name = Cream, Type = Group, MBID = A000, ... >> >> So the second reduce call has data in it from the first reduce call. Very >> strange! At a guess I would say the reducer is re-using the object when it >> reads the objects back from the mapping step. I dunno.. >> >> If anyone has any ideas, I'm open to suggestions. 0.20.2-cdh3u0 >> >> Thanks! >> >> R >> >> >> >> >