I did some more investigation.  I found weird behavior in the readString() 
method of BinaryDecoder.java in Avro source code if we have the statement 
record.put("rowkey", key) in the reduce() method.  Does this mean that there is 
a bug in BinaryDecoder.java ?  Thanks.
Ey-Chih Chow 

From: eyc...@hotmail.com
To: user@avro.apache.org
Subject: RE: is this a bug?
Date: Fri, 4 Mar 2011 00:48:55 -0800








What follows are fragments of trace logs of our MR jobs corresponding 
respectively to with and without the statement 'record.put("rowkey", key)' 
mentioned in the previous messages.  From the last line, logged at the entry of 
the reduce() method, of each of these two logs you can see the difference.  
I.e. for the first segment, the log is 'working on 
0000000200000000000000000000000000002 whose rowKey is 
0000000300000000000000000000000000003' for the second segment, the log is 
'working on 0000000200000000000000000000000000002 whose rowKey is 
0000000200000000000000000000000000002',  where the second log is what we 
expected, corresponding to the correct key values pair passed to the reduce() 
method.  Note that these two fragments of logs are generated by adding some 
additional log statements to Hadoop and Avro source code.

Can anybody help to see if this is a bug in Avro or Hadoop code?

==============================================================================================================

log fragment with the statement 'record.put("rowkey", key)

2011-03-03 18:00:00,180 INFO org.apache.hadoop.mapred.ReduceTask: trace bug 
isSkipping():false
2011-03-03 18:00:00,190 INFO org.apache.avro.mapred.AvroSerialization: trace 
bug deserialize() reader org.apache.avro.specific.SpecificDatumReader@1a001ff
2011-03-03 18:00:00,198 INFO org.apache.avro.generic.GenericDatumReader: trace 
bug type of expected STRING
2011-03-03 18:00:00,199 INFO org.apache.avro.mapred.AvroSerialization: trace 
bug deserialized datum 0000000000000000000000000000000000000
2011-03-03 18:00:00,199 INFO org.apache.hadoop.mapred.TaskRunner: trace bug1 
deserializer is 
org.apache.avro.mapred.AvroSerialization$AvroWrapperDeserializer@1abcc03
2011-03-03 18:00:00,199 INFO org.apache.hadoop.mapred.TaskRunner: trace bug1 
key is 0000000000000000000000000000000000000
2011-03-03 18:00:00,199 INFO org.apache.hadoop.mapred.ReduceTask: trace bug 
done with set values
2011-03-03 18:00:00,199 INFO org.apache.hadoop.mapred.ReduceTask: trace bug key 
is 0000000000000000000000000000000000000 values is 
org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator@1deeb40
2011-03-03 18:00:00,199 INFO 
com.ngmoco.ngpipes.sourcing.NgActivityGatheringReducer: work on key 
0000000000000000000000000000000000000
2011-03-03 18:00:00,199 INFO org.apache.avro.mapred.AvroSerialization: trace 
bug deserialize() reader org.apache.avro.specific.SpecificDatumReader@26e9f9
2011-03-03 18:00:00,208 INFO org.apache.avro.generic.GenericDatumReader: trace 
bug type of expected RECORD
2011-03-03 18:00:00,208 INFO org.apache.avro.mapred.AvroSerialization: trace 
bug deserialized datum {"rowKey": "0000000000000000000000000000000000000", 
"tableName": null, "Games__": [{"columnName": "0_TESTFAM_TESTSKU_1.5", 
"columnValue": {"bytes": "ame": "hwty", "columnValue": "stringvalue"}, 
{"columnName": "loc", "columnValue": "stringvalue"}, {"columnName": "osrev", 
"columnValue": "stringvalue"}, {"columnName": "tz", "columnValue": 
"stringvalue"}], "PlayerState__": [{"columnName": 
"0_TESTFAM_TESTSKU_1.0=GC=2010:01:01:07", "columnValue": 
"{"mojo":10,"afloat":1.99,"hat":"red"}", "timestamp": 123456789}, 
{"columnName": "0_TESTFAM_TESTSKU_1.0=GS=2010:01:01:07", "columnValue": 
"{"mojo":10,"afloat":1.99,"hat":"red"}", "timestamp": 123456799}], 
"ClientSessions__": null, "ServerSessions__": null, "Monetization__": null}
2011-03-03 18:00:00,208 INFO org.apache.hadoop.mapred.TaskRunner: trace bug1 
value is {"rowKey": "0000000000000000000000000000000000000", "tableName": 
null,"Games__": [{"columnName": "0_TESTFAM_TESTSKU_1.5", "columnValue": 
{"bytes": "ame": "hwty", "columnValue": "stringvalue"}, {"columnName": "loc", 
"columnValue": "stringvalue"}, {"columnName": "osrev", "columnValue": 
"stringvalue"}, {"columnName": "tz", "columnValue": "stringvalue"}], 
"PlayerState__": [{"columnName":"0_TESTFAM_TESTSKU_1.0=GC=2010:01:01:07", 
"columnValue": "{"mojo":10,"afloat":1.99,"hat":"red"}", "timestamp": 
123456789}, {"columnName":"0_TESTFAM_TESTSKU_1.0=GS=2010:01:01:07", 
"columnValue": "{"mojo":10,"afloat":1.99,"hat":"red"}", "timestamp": 
123456799}], "ClientSessions__": null, "ServerSessions__": null, 
"Monetization__": null}
2011-03-03 18:00:00,208 INFO org.apache.hadoop.mapred.Merger: trace bug adjust 
priority queue
2011-03-03 18:00:00,208 INFO org.apache.avro.mapred.AvroSerialization: trace 
bug deserialize() reader org.apache.avro.specific.SpecificDatumReader@1a001ff
2011-03-03 18:00:00,208 INFO org.apache.avro.generic.GenericDatumReader: trace 
bug type of expected STRING
2011-03-03 18:00:00,209 INFO org.apache.avro.mapred.AvroSerialization: trace 
bug deserialized datum 0000000100000000000000000000000000001
2011-03-03 18:00:00,209 INFO org.apache.hadoop.mapred.TaskRunner: trace bug1 
deserializer is 
org.apache.avro.mapred.AvroSerialization$AvroWrapperDeserializer@1abcc03
2011-03-03 18:00:00,209 INFO org.apache.hadoop.mapred.TaskRunner: trace bug1 
key is 0000000100000000000000000000000000001
2011-03-03 18:00:00,210 INFO 
com.ngmoco.ngpipes.sourcing.NgActivityGatheringReducer: working on 
0000000000000000000000000000000000000 whose rowKey is 
0000000000000000000000000000000000000
2011-03-03 18:00:00,215 INFO org.apache.hadoop.mapred.ReduceTask: trace bug 
call nextKey()
2011-03-03 18:00:00,215 INFO org.apache.hadoop.mapred.ReduceTask: trace bug key 
is 0000000100000000000000000000000000001 values is 
org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator@1deeb40
2011-03-03 18:00:00,215 INFO 
com.ngmoco.ngpipes.sourcing.NgActivityGatheringReducer: work on key 
0000000100000000000000000000000000001
2011-03-03 18:00:00,216 INFO org.apache.avro.mapred.AvroSerialization: trace 
bug deserialize() reader org.apache.avro.specific.SpecificDatumReader@26e9f9
2011-03-03 18:00:00,216 INFO org.apache.avro.generic.GenericDatumReader: trace 
bug type of expected RECORD
2011-03-03 18:00:00,216 INFO org.apache.avro.mapred.AvroSerialization: trace 
bug deserialized datum {"rowKey": "0000000100000000000000000000000000001", 
"tableName":null, "Gam, "Metadata__": null, "PlayerState__": [{"columnName": 
"0_TESTFAM2_TESTSKU2_1.0=GS=2010:01:01:07", "columnValue": 
"{"mojo":10,"afloat":1.99,"hat":"red"}","timestamp": 123456789}], 
"ClientSessions__": null, "ServerSessions__": null, "Monetization__": null}
2011-03-03 18:00:00,216 INFO org.apache.hadoop.mapred.TaskRunner: trace bug1 
value is {"rowKey": "0000000100000000000000000000000000001", "tableName": null, 
"Games__": [{"colu: null, "PlayerState__": [{"columnName": 
"0_TESTFAM2_TESTSKU2_1.0=GS=2010:01:01:07", "columnValue": 
"{"mojo":10,"afloat":1.99,"hat":"red"}", "timestamp": 123456789}], 
"ClientSessions__": null, "ServerSessions__": null, "Monetization__": null}
2011-03-03 18:00:00,216 INFO org.apache.hadoop.mapred.Merger: trace bug adjust 
priority queue
2011-03-03 18:00:00,216 INFO org.apache.avro.mapred.AvroSerialization: trace 
bug deserialize() reader org.apache.avro.specific.SpecificDatumReader@1a001ff
2011-03-03 18:00:00,216 INFO org.apache.avro.generic.GenericDatumReader: trace 
bug type of expected STRING
2011-03-03 18:00:00,216 INFO org.apache.avro.mapred.AvroSerialization: trace 
bug deserialized datum 0000000200000000000000000000000000002
2011-03-03 18:00:00,216 INFO org.apache.hadoop.mapred.TaskRunner: trace bug1 
deserializer is 
org.apache.avro.mapred.AvroSerialization$AvroWrapperDeserializer@1abcc03
2011-03-03 18:00:00,216 INFO org.apache.hadoop.mapred.TaskRunner: trace bug1 
key is 0000000200000000000000000000000000002
2011-03-03 18:00:00,216 INFO 
com.ngmoco.ngpipes.sourcing.NgActivityGatheringReducer: working on 
0000000100000000000000000000000000001 whose rowKey is 
0000000200000000000000000000000000002
2011-03-03 18:00:00,217 INFO org.apache.hadoop.mapred.ReduceTask: trace bug 
call nextKey()
2011-03-03 18:00:00,217 INFO org.apache.hadoop.mapred.ReduceTask: trace bug key 
is 0000000200000000000000000000000000002 values is 
org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator@1deeb40
2011-03-03 18:00:00,217 INFO 
com.ngmoco.ngpipes.sourcing.NgActivityGatheringReducer: work on key 
0000000200000000000000000000000000002
2011-03-03 18:00:00,217 INFO org.apache.avro.mapred.AvroSerialization: trace 
bug deserialize() reader org.apache.avro.specific.SpecificDatumReader@26e9f9
2011-03-03 18:00:00,217 INFO org.apache.avro.generic.GenericDatumReader: trace 
bug type of expected RECORD
2011-03-03 18:00:00,218 INFO org.apache.avro.mapred.AvroSerialization: trace 
bug deserialized datum {"rowKey": "0000000200000000000000000000000000002", 
"tableName": null, "Gam, "Metadata__": null, "PlayerState__": [{"columnName": 
"0_TESTFAM3_TESTSKU3_1.0=GS=2010:01:01:07", "columnValue": 
"{"mojo":10,"afloat":1.99,"hat":"red"}", "timestamp": 123456899}], 
"ClientSessions__": null, "ServerSessions__": null, "Monetization__": null}
2011-03-03 18:00:00,218 INFO org.apache.hadoop.mapred.TaskRunner: trace bug1 
value is {"rowKey": "0000000200000000000000000000000000002", "tableName": null, 
"Games__": [{"colu: null, "PlayerState__": [{"columnName": 
"0_TESTFAM3_TESTSKU3_1.0=GS=2010:01:01:07", "columnValue": 
"{"mojo":10,"afloat":1.99,"hat":"red"}", "timestamp": 123456899}], 
"ClientSessions__": null, "ServerSessions__": null, "Monetization__": null}
2011-03-03 18:00:00,218 INFO org.apache.hadoop.mapred.Merger: trace bug adjust 
priority queue
2011-03-03 18:00:00,218 INFO org.apache.avro.mapred.AvroSerialization: trace 
bug deserialize() reader org.apache.avro.specific.SpecificDatumReader@1a001ff
2011-03-03 18:00:00,218 INFO org.apache.avro.generic.GenericDatumReader: trace 
bug type of expected STRING
2011-03-03 18:00:00,218 INFO org.apache.avro.mapred.AvroSerialization: trace 
bug deserialized datum 0000000300000000000000000000000000003
2011-03-03 18:00:00,218 INFO org.apache.hadoop.mapred.TaskRunner: trace bug1 
deserializer is 
org.apache.avro.mapred.AvroSerialization$AvroWrapperDeserializer@1abcc03
2011-03-03 18:00:00,218 INFO org.apache.hadoop.mapred.TaskRunner: trace bug1 
key is 0000000300000000000000000000000000003
2011-03-03 18:00:00,218 INFO 
com.ngmoco.ngpipes.sourcing.NgActivityGatheringReducer: working on 
0000000200000000000000000000000000002 whose rowKey is 
0000000300000000000000000000000000003
   
===========================================================================================================

log fragment without the statement 'record.put("rowkey", key)

2011-03-03 21:02:05,077 INFO org.apache.hadoop.mapred.ReduceTask: trace bug 
isSkipping():false
2011-03-03 21:02:05,092 INFO org.apache.avro.mapred.AvroSerialization: trace 
bug deserialize() reader org.apache.avro.specific.SpecificDatumReader@1a001ff
2011-03-03 21:02:05,102 INFO org.apache.avro.generic.GenericDatumReader: trace 
bug type of expected STRING
2011-03-03 21:02:05,102 INFO org.apache.avro.mapred.AvroSerialization: trace 
bug deserialized datum 0000000000000000000000000000000000000
2011-03-03 21:02:05,102 INFO org.apache.hadoop.mapred.TaskRunner: trace bug1 
deserializer is 
org.apache.avro.mapred.AvroSerialization$AvroWrapperDeserializer@1abcc03
2011-03-03 21:02:05,102 INFO org.apache.hadoop.mapred.TaskRunner: trace bug1 
key is 0000000000000000000000000000000000000
2011-03-03 21:02:05,102 INFO org.apache.hadoop.mapred.ReduceTask: trace bug 
done with set values
2011-03-03 21:02:05,102 INFO org.apache.hadoop.mapred.ReduceTask: trace bug key 
is 0000000000000000000000000000000000000 values is 
org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator@1deeb40
2011-03-03 21:02:05,103 INFO 
com.ngmoco.ngpipes.sourcing.NgActivityGatheringReducer: work on key 
0000000000000000000000000000000000000
2011-03-03 21:02:05,103 INFO org.apache.avro.mapred.AvroSerialization: trace 
bug deserialize() reader org.apache.avro.specific.SpecificDatumReader@26e9f9
2011-03-03 21:02:05,113 INFO org.apache.avro.generic.GenericDatumReader: trace 
bug type of expected RECORD
2011-03-03 21:02:05,114 INFO org.apache.avro.mapred.AvroSerialization: trace 
bug deserialized datum {"rowKey": "0000000000000000000000000000000000000", 
"tableName": null, "Games__": [{"columnName": "0_TESTFAM_TESTSKU_1.5", 
"columnValue": {"bytes": "ame": "hwty", "columnValue": "stringvalue"}, 
{"columnName": "loc", "columnValue": "stringvalue"}, {"columnName": "osrev", 
"columnValue": "stringvalue"}, {"columnName": "tz", "columnValue": 
"stringvalue"}], "PlayerState__": [{"columnName": 
"0_TESTFAM_TESTSKU_1.0=GC=2010:01:01:07", "columnValue": 
"{"mojo":10,"afloat":1.99,"hat":"red"}", "timestamp": 123456789}, 
{"columnName": "0_TESTFAM_TESTSKU_1.0=GS=2010:01:01:07", "columnValue": 
"{"mojo":10,"afloat":1.99,"hat":"red"}", "timestamp": 123456799}], 
"ClientSessions__": null, "ServerSessions__": null, "Monetization__": null}
2011-03-03 21:02:05,114 INFO org.apache.hadoop.mapred.TaskRunner: trace bug1 
value is {"rowKey": "0000000000000000000000000000000000000", "tableName": null, 
"Games__": [{"columnName": "0_TESTFAM_TESTSKU_1.5", "columnValue": {"bytes": 
"ame": "hwty", "columnValue": "stringvalue"}, {"columnName": "loc", 
"columnValue": "stringvalue"}, {"columnName": "osrev", "columnValue": 
"stringvalue"}, {"columnName": "tz", "columnValue": "stringvalue"}], 
"PlayerState__": [{"columnName": "0_TESTFAM_TESTSKU_1.0=GC=2010:01:01:07", 
"columnValue": "{"mojo":10,"afloat":1.99,"hat":"red"}", "timestamp": 
123456789}, {"columnName": "0_TESTFAM_TESTSKU_1.0=GS=2010:01:01:07", 
"columnValue": "{"mojo":10,"afloat":1.99,"hat":"red"}", "timestamp": 
123456799}], "ClientSessions__": null, "ServerSessions__": null, 
"Monetization__": null}
2011-03-03 21:02:05,114 INFO org.apache.hadoop.mapred.Merger: trace bug adjust 
priority queue
2011-03-03 21:02:05,114 INFO org.apache.avro.mapred.AvroSerialization: trace 
bug deserialize() reader org.apache.avro.specific.SpecificDatumReader@1a001ff
2011-03-03 21:02:05,114 INFO org.apache.avro.generic.GenericDatumReader: trace 
bug type of expected STRING
2011-03-03 21:02:05,114 INFO org.apache.avro.mapred.AvroSerialization: trace 
bug deserialized datum 0000000100000000000000000000000000001
2011-03-03 21:02:05,114 INFO org.apache.hadoop.mapred.TaskRunner: trace bug1 
deserializer is 
org.apache.avro.mapred.AvroSerialization$AvroWrapperDeserializer@1abcc03
2011-03-03 21:02:05,114 INFO org.apache.hadoop.mapred.TaskRunner: trace bug1 
key is 0000000100000000000000000000000000001
2011-03-03 21:02:05,115 INFO 
com.ngmoco.ngpipes.sourcing.NgActivityGatheringReducer: working on 
0000000000000000000000000000000000000 whose rowKey is 
0000000000000000000000000000000000000
2011-03-03 21:02:05,121 INFO org.apache.hadoop.mapred.ReduceTask: trace bug 
call nextKey()
2011-03-03 21:02:05,121 INFO org.apache.hadoop.mapred.ReduceTask: trace bug key 
is 0000000100000000000000000000000000001 values is 
org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator@1deeb40
2011-03-03 21:02:05,121 INFO 
com.ngmoco.ngpipes.sourcing.NgActivityGatheringReducer: work on key 
0000000100000000000000000000000000001
2011-03-03 21:02:05,121 INFO org.apache.avro.mapred.AvroSerialization: trace 
bug deserialize() reader org.apache.avro.specific.SpecificDatumReader@26e9f9
2011-03-03 21:02:05,121 INFO org.apache.avro.generic.GenericDatumReader: trace 
bug type of expected RECORD
2011-03-03 21:02:05,121 INFO org.apache.avro.mapred.AvroSerialization: trace 
bug deserialized datum {"rowKey": "0000000100000000000000000000000000001", 
"tableName": null, "Gam, "Metadata__": null, "PlayerState__": [{"columnName": 
"0_TESTFAM2_TESTSKU2_1.0=GS=2010:01:01:07", "columnValue": 
"{"mojo":10,"afloat":1.99,"hat":"red"}", "timestamp": 123456789}], 
"ClientSessions__": null, "ServerSessions__": null, "Monetization__": null}
2011-03-03 21:02:05,121 INFO org.apache.hadoop.mapred.TaskRunner: trace bug1 
value is {"rowKey": "0000000100000000000000000000000000001", "tableName": null, 
"Games__": [{"colu: null, "PlayerState__": [{"columnName": 
"0_TESTFAM2_TESTSKU2_1.0=GS=2010:01:01:07", "columnValue": 
"{"mojo":10,"afloat":1.99,"hat":"red"}", "timestamp": 123456789}], 
"ClientSessions__": null, "ServerSessions__": null, "Monetization__": null}
2011-03-03 21:02:05,122 INFO org.apache.hadoop.mapred.Merger: trace bug adjust 
priority queue
2011-03-03 21:02:05,122 INFO org.apache.avro.mapred.AvroSerialization: trace 
bug deserialize() reader org.apache.avro.specific.SpecificDatumReader@1a001ff
2011-03-03 21:02:05,122 INFO org.apache.avro.generic.GenericDatumReader: trace 
bug type of expected STRING
2011-03-03 21:02:05,122 INFO org.apache.avro.mapred.AvroSerialization: trace 
bug deserialized datum 0000000200000000000000000000000000002
2011-03-03 21:02:05,122 INFO org.apache.hadoop.mapred.TaskRunner: trace bug1 
deserializer is 
org.apache.avro.mapred.AvroSerialization$AvroWrapperDeserializer@1abcc03
2011-03-03 21:02:05,122 INFO org.apache.hadoop.mapred.TaskRunner: trace bug1 
key is 0000000200000000000000000000000000002
2011-03-03 21:02:05,122 INFO 
com.ngmoco.ngpipes.sourcing.NgActivityGatheringReducer: working on 
0000000100000000000000000000000000001 whose rowKey is 
0000000100000000000000000000000000001
2011-03-03 21:02:05,123 INFO org.apache.hadoop.mapred.ReduceTask: trace bug 
call nextKey()
2011-03-03 21:02:05,123 INFO org.apache.hadoop.mapred.ReduceTask: trace bug key 
is 0000000200000000000000000000000000002 values is 
org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator@1deeb40
2011-03-03 21:02:05,123 INFO 
com.ngmoco.ngpipes.sourcing.NgActivityGatheringReducer: work on key 
0000000200000000000000000000000000002
2011-03-03 21:02:05,123 INFO org.apache.avro.mapred.AvroSerialization: trace 
bug deserialize() reader org.apache.avro.specific.SpecificDatumReader@26e9f9
2011-03-03 21:02:05,123 INFO org.apache.avro.generic.GenericDatumReader: trace 
bug type of expected RECORD
2011-03-03 21:02:05,123 INFO org.apache.avro.mapred.AvroSerialization: trace 
bug deserialized datum {"rowKey": "0000000200000000000000000000000000002", 
"tableName": null, "Gam, "Metadata__": null, "PlayerState__": [{"columnName": 
"0_TESTFAM3_TESTSKU3_1.0=GS=2010:01:01:07", "columnValue": 
"{"mojo":10,"afloat":1.99,"hat":"red"}", "timestamp": 123456899}], 
"ClientSessions__": null, "ServerSessions__": null, "Monetization__": null}
2011-03-03 21:02:05,123 INFO org.apache.hadoop.mapred.TaskRunner: trace bug1 
value is {"rowKey": "0000000200000000000000000000000000002", "tableName": null, 
"Games__": [{"colu: null, "PlayerState__": [{"columnName": 
"0_TESTFAM3_TESTSKU3_1.0=GS=2010:01:01:07", "columnValue": 
"{"mojo":10,"afloat":1.99,"hat":"red"}", "timestamp": 123456899}], 
"ClientSessions__": null, "ServerSessions__": null, "Monetization__": null}
2011-03-03 21:02:05,123 INFO org.apache.hadoop.mapred.Merger: trace bug adjust 
priority queue
2011-03-03 21:02:05,123 INFO org.apache.avro.mapred.AvroSerialization: trace 
bug deserialize() reader org.apache.avro.specific.SpecificDatumReader@1a001ff
2011-03-03 21:02:05,123 INFO org.apache.avro.generic.GenericDatumReader: trace 
bug type of expected STRING
2011-03-03 21:02:05,123 INFO org.apache.avro.mapred.AvroSerialization: trace 
bug deserialized datum 0000000300000000000000000000000000003
2011-03-03 21:02:05,123 INFO org.apache.hadoop.mapred.TaskRunner: trace bug1 
deserializer is 
org.apache.avro.mapred.AvroSerialization$AvroWrapperDeserializer@1abcc03
2011-03-03 21:02:05,123 INFO org.apache.hadoop.mapred.TaskRunner: trace bug1 
key is 0000000300000000000000000000000000003
2011-03-03 21:02:05,124 INFO 
com.ngmoco.ngpipes.sourcing.NgActivityGatheringReducer: working on 
0000000200000000000000000000000000002 whose rowKey is 
0000000200000000000000000000000000002

From: eyc...@hotmail.com
To: user@avro.apache.org
Subject: RE: is this a bug?
Date: Wed, 2 Mar 2011 16:12:20 -0800








Sorry I found that my previous message in the archive become all in black.  Let 
me re-explain the problem.  The following piece of code for AvroReducer causes 
problem:
           public void reduce(Utf8 key, Iterable<GenericRecord> values, 
AvroCollector<GenericRecord> collector, Reporter reporter) throws IOException { 
                        GenericRecord record = null;                         
for (GenericRecord value : values) {                                 -- code 
omitted here --                                 record = value;                 
                record.put("rowkey", key);   <=== this statement causes problem 
                                collector.collect(record);                      
   }            }
As explained in my previous message, if I remove the statement 
record.put("rowkey", key), the code works fine, in that the key values pairs 
passed to the routine reduce() are correct.  But if you add this statement, the 
key values pairs passed to the routine reduce() are out of order, something 
like (key1, values1), (key2, values3) rather than (key2, values2).  Some 
details are explained in my previous message.  Is  this problem relating to 
Hadoop binary iterators or Avro deserialization code?  Thanks.
Ey-Chih Chow
From: eyc...@hotmail.com
To: user@avro.apache.org
Subject: is this a bug?
Date: Wed, 2 Mar 2011 13:05:55 -0800








Hi,
I am working on an Avro MR job and encountering an issue with AvroReducer<Utf8, 
GenericRecord, GenericRecord>. The corresponding reduce() routine is 
implemented in the following way:
public void reduce(Utf8 key, Iterable<GenericRecord> values, 
AvroCollector<GenericRecord> collector, Reporter reporter) throws IOException {
                                  .                                  .          
                        .
       GenericRecord record = null;
       for (GenericRecord value : values) {                                   . 
                                  .                                   .         
   record = value;            record.put("rowkey", key);                        
           .                                   .                                
   .            collector.collect(record);         }} 
If I comment out the statement in red in the above code, the reduce function 
gets called properly with CORRECT key values pairs passed to reduce().  
However, if I add the statement in red to the routine, the reduce function is 
called with WRONG key values pairs, in the sense that key2 paired with values3, 
instead of values2, when passed to the reduce() routine.  I traced this problem 
by including Hadoop source code, such as ReduceTask.java, Task.java, and Avro 
source code, such as HadoopReducer.java, HadoopReducerBase.java, and all the 
serialization code.  The problem showed up on the second call of the reduce(), 
but I can not locate the exact place that cause the problem.  My intuition is 
that this is incurred in either the hadoop iterators after merge sort or Avro 
deserialization.  Is there anybody can help me on this?  Thanks.
Ey-Chih Chow                                                              

Reply via email to