Robert Goodman wrote:
     In a Avro formatted file the metadata is at the end of the file and
     must be read to get the schema, codec, etc. If this is done by every
     RecordReader when processing a file split, it will be an expensive
     operation. The RecordReader would have to access data which not
     likely be local to the node. As an optimization the Input Formatter
     can open the file and read the metadata when determining the file
     splits. The metadata information can be put in the InputSplit which
     is passed to the Avro RecordReader.

     In processing a file split a Record Reader must sync to the start of
     a block and read past the end of the file split until it reaches the
     end of the current block. The record reader needs to know when it has
     reached a block boundary.

Note that Hadoop's SequenceFile currently reads a small amount of metadata from the start of files but that this does not seem to affect performance much.

You are right to note that Avro's data file currently requires metadata to be also read from the end of the file. But there's currently a proposal to change Avro's data file format:

  https://issues.apache.org/jira/browse/AVRO-160

In short, the plan is to put metadata at the front of each block in the file. Mapreduce applications would still need to read the sync marker from the head of the file before processing a split, but would no longer need to also read metadata from the end of the file.

Your proposal to read sync markers and/or metadata when constructing splits has the downside that it could serialize something that's otherwise done in parallel. For example, let's assume your jobs has 1000 input files each with 10 splits on a 100 node cluster. With your proposal you'd need to open and read the headers of 1000 files in the client at job submit time. Having map tasks read these instead would result in 10,000 reads of metadata, but they'd happen in parallel, 100 or more at a time, and all but the first for each file would probably not require a seek. The job client could be written to do these in parallel using a thread pool, but I doubt there would be much net job speedup, since the amount of metadata is small and fits in a packet or two.

     It is possible for an error to occur when processing an Avro record.
     A RecordReader should never just stop on an error and throw an
     exception. When an error occurs the Avro RecordReader should sync to
     the next block and continue reading.

I agree that ignoring errors should be possible, but it isn't always best.

     The RecordReader needs to be able to get the current position in the
     file to determine when it has reached the end of file split.

Yes, DataFileReader() should probably have a tell() method.

Some re-factoring of DataFileReader.java would allow the class to be used
by an Avro RecordReader. The following enhancements are being proposed.

     Add a constructor which allows the metadata to be passed as a
     parameter. The new proposed constructor will not sync to the end of
     the file and read the metadata.

I think AVRO-160 will mostly obviate the need for this.

     Add a getBlockCount() method which returns the blockCount. This will
     allow the a RecordReader to determine when to stop reading by
     checking if the blockCount is zero after reading pass the end of the
     file split.

Won't a tell() method be sufficient for this? With block-based i/o and compression, tell() will generally return the position of the beginning of the current block, i.e., will not be incremented except when block boundaries are crossed.

     Add a syncReset() method which will sync to the next block marker and
     reset the blockCount. This method is used by the record reader to
     move to the next block when an error occurs and continue reading.

Wouldn't sync(tell()) implement this?

     Add a tell() method which will return the current position in the
     file. This is used by the record reader to determine when it has read
     past the end of a file split.

+1 This is certainly needed.

If I make these changes, is there interest in have these changes
contributed back to Avro.

Please file a Jira issue for each that you intend to implement and discussion can continue there.

Cheers,

Doug

Reply via email to