[ https://issues.apache.org/jira/browse/AVRO-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Douglas Creager updated AVRO-986: --------------------------------- Attachment: quickstop.db I've attached a copy of the quickstop.db Avro file; this is generated by one of the C test cases. It contains the avro.sync metadata field. I'm happy to add this to the share directory also, but unfortunately I don't know enough about the Java build scripts to write a test case for Doug's patch. > Avro files generated from avro-c dont work with the Java mapred > implementation. > ------------------------------------------------------------------------------- > > Key: AVRO-986 > URL: https://issues.apache.org/jira/browse/AVRO-986 > Project: Avro > Issue Type: Bug > Components: c, java > Environment: avro-c 1.6.2-SNAPSHOT > avro-java 1.6.2-SNAPSHOT > hadoop 0.20.2 > Reporter: Michael Cooper > Priority: Critical > Labels: c, hadoop, java, mapreduce > Attachments: 0001-Remove-sync-marker-from-metadata-in-header.patch, > AVRO-986-java.patch, quickstop.db > > > When a file generated from the Avro-C implementation is fed into Hadoop, it > will fail with "Block size invalid or too large for this implementation: -49". > This is caused by the sync marker, namely the one that Avro-C puts into the > header... > The org.apache.avro.mapred.AvroRecordReader uses a FileSplit object to work > out where it should read from, but this class is not particularly smart, it > just divides the file up into equal size chunks, the first being with > position 0. > So org.apache.avro.mapred.AvroRecordReader gets 0 as the start of its chunk, > and calls > {code:title=AvroRecordReader.java}reader.sync(split.getStart()); // sync to > start{code} > Then the org.apache.avro.file.DataFileReader::seek() goes to 0, then searches > for a sync marker.... > It encounters one at position 32, the one in the header metadata map, > "avro.sync" > No other implementations add the sync marker in the metadata map, and none > read it from there, not even the C version. > I suggest we remove this from the header as the simplest solution. > Another solution would be to create an AvroFileSplit class in mapred that > knows where the blocks are, and provides the correct locations in the first > place. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira