I'm starting to use Avro to as the data storage format for my project. In
using Avro in a RecordReader a number of issues arise.

      In a Avro formatted file the metadata is at the end of the file and
      must be read to get the schema, codec, etc. If this is done by every
      RecordReader when processing a file split, it will be an expensive
      operation. The RecordReader would have to access data which not
      likely be local to the node. As an optimization the Input Formatter
      can open the file and read the metadata when determining the file
      splits. The metadata information can be put in the InputSplit which
      is passed to the Avro RecordReader.
      In processing a file split a Record Reader must sync to the start of
      a block and read past the end of the file split until it reaches the
      end of the current block. The record reader needs to know when it has
      reached a block boundary.
      It is possible for an error to occur when processing an Avro record.
      A RecordReader should never just stop on an error and throw an
      exception. When an error occurs the Avro RecordReader should sync to
      the next block and continue reading.
      The RecordReader needs to be able to get the current position in the
      file to determine when it has reached the end of file split.

Some re-factoring of DataFileReader.java would allow the class to be used
by an Avro RecordReader. The following enhancements are being proposed.

      Add a constructor which allows the metadata to be passed as a
      parameter. The new proposed constructor will not sync to the end of
      the file and read the metadata.
      Add a getBlockCount() method which returns the blockCount. This will
      allow the a RecordReader to determine when to stop reading by
      checking if the blockCount is zero after reading pass the end of the
      file split.
      Add a syncReset() method which will sync to the next block marker and
      reset the blockCount. This method is used by the record reader to
      move to the next block when an error occurs and continue reading.
      Add a tell() method which will return the current position in the
      file. This is used by the record reader to determine when it has read
      past the end of a file split.


If I make these changes, is there interest in have these changes
contributed back to Avro.

   Thanks
    Bob


Robert Goodman
 IBM Big Sheets

Reply via email to