I'm starting to use Avro to as the data storage format for my project. In using Avro in a RecordReader a number of issues arise.
In a Avro formatted file the metadata is at the end of the file and must be read to get the schema, codec, etc. If this is done by every RecordReader when processing a file split, it will be an expensive operation. The RecordReader would have to access data which not likely be local to the node. As an optimization the Input Formatter can open the file and read the metadata when determining the file splits. The metadata information can be put in the InputSplit which is passed to the Avro RecordReader. In processing a file split a Record Reader must sync to the start of a block and read past the end of the file split until it reaches the end of the current block. The record reader needs to know when it has reached a block boundary. It is possible for an error to occur when processing an Avro record. A RecordReader should never just stop on an error and throw an exception. When an error occurs the Avro RecordReader should sync to the next block and continue reading. The RecordReader needs to be able to get the current position in the file to determine when it has reached the end of file split. Some re-factoring of DataFileReader.java would allow the class to be used by an Avro RecordReader. The following enhancements are being proposed. Add a constructor which allows the metadata to be passed as a parameter. The new proposed constructor will not sync to the end of the file and read the metadata. Add a getBlockCount() method which returns the blockCount. This will allow the a RecordReader to determine when to stop reading by checking if the blockCount is zero after reading pass the end of the file split. Add a syncReset() method which will sync to the next block marker and reset the blockCount. This method is used by the record reader to move to the next block when an error occurs and continue reading. Add a tell() method which will return the current position in the file. This is used by the record reader to determine when it has read past the end of a file split. If I make these changes, is there interest in have these changes contributed back to Avro. Thanks Bob Robert Goodman IBM Big Sheets