Yong Zhang created AVRO-1953: -------------------------------- Summary: ArrayIndexOutOfBoundsException in org.apache.avro.io.parsing.Symbol$Alternative.getSymbol Key: AVRO-1953 URL: https://issues.apache.org/jira/browse/AVRO-1953 Project: Avro Issue Type: Bug Affects Versions: 1.7.4 Reporter: Yong Zhang
We are facing an issue when Avro MapReducer cannot process the avro file in the reducer. Here is the schema of our data: { "namespace" : "our package name", "type" : "record", "name" : "Lists", "fields" : [ {"name" : "account_id", "type" : "long"}, {"name" : "list_id", "type" : "string"}, {"name" : "sequence_id", "type" : ["int", "null"]} , {"name" : "name", "type" : ["string", "null"]}, {"name" : "state", "type" : ["string", "null"]}, {"name" : "description", "type" : ["string", "null"]}, {"name" : "dynamic_filtered_list", "type" : ["int", "null"]}, {"name" : "filter_criteria", "type" : ["string", "null"]}, {"name" : "created_at", "type" : ["long", "null"]}, {"name" : "updated_at", "type" : ["long", "null"]}, {"name" : "deleted_at", "type" : ["long", "null"]}, {"name" : "favorite", "type" : ["int", "null"]}, {"name" : "delta", "type" : ["boolean", "null"]}, { "name" : "list_memberships", "type" : { "type" : "array", "items" : { "name" : "ListMembership", "type" : "record", "fields" : [ {"name" : "channel_id", "type" : "string"}, {"name" : "created_at", "type" : ["long", "null"]}, {"name" : "created_source", "type" : ["string", "null"]}, {"name" : "deleted_at", "type" : ["long", "null"]}, {"name" : "sequence_id", "type" : ["int", "null"]} ] } } } ] } Our MapReduce job is to get the delta of the above dataset, and use our merge logic to merge the latest change into the dataset. The whole MR job runs daily, and work fine for 18 months. During this time, we saw 2 times the merge MapReduce job failed with following error (In the reducer stage, which means the Avro data being read successfully, and send to the reducers, which we sort the data based on the key and timestamp, so the delta can be merged in the reducer side): java.lang.ArrayIndexOutOfBoundsException at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:364) at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229) at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139) at org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:108) at org.apache.avro.hadoop.io.AvroDeserializer.deserialize(AvroDeserializer.java:48) at org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:142) at org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:117) at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:297) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:165) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:652) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(AccessController.java:366) at javax.security.auth.Subject.doAs(Subject.java:572) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1502) at org.apache.hadoop.mapred.Child.main(Child.java:249) The MapReducer job will fail eventually in the reducer stage. I don't think our data is corrupted, as they are read fine in the map stage. Every time we got this error, we have to get the whole huge dataset from the source, then rebuilt the AVRO, and start building merge again daily, until after several months, then face this issue due to whatever reason we don't know yet. -- This message was sent by Atlassian JIRA (v6.3.4#6332)