Robert Goodman wrote:
In a Avro formatted file the metadata is at the end of the file and
must be read to get the schema, codec, etc. If this is done by every
RecordReader when processing a file split, it will be an expensive
operation. The RecordReader would have to access data which not
likely be local to the node. As an optimization the Input Formatter
can open the file and read the metadata when determining the file
splits. The metadata information can be put in the InputSplit which
is passed to the Avro RecordReader.
In processing a file split a Record Reader must sync to the start of
a block and read past the end of the file split until it reaches the
end of the current block. The record reader needs to know when it has
reached a block boundary.
Note that Hadoop's SequenceFile currently reads a small amount of
metadata from the start of files but that this does not seem to affect
performance much.
You are right to note that Avro's data file currently requires metadata
to be also read from the end of the file. But there's currently a
proposal to change Avro's data file format:
https://issues.apache.org/jira/browse/AVRO-160
In short, the plan is to put metadata at the front of each block in the
file. Mapreduce applications would still need to read the sync marker
from the head of the file before processing a split, but would no longer
need to also read metadata from the end of the file.
Your proposal to read sync markers and/or metadata when constructing
splits has the downside that it could serialize something that's
otherwise done in parallel. For example, let's assume your jobs has
1000 input files each with 10 splits on a 100 node cluster. With your
proposal you'd need to open and read the headers of 1000 files in the
client at job submit time. Having map tasks read these instead would
result in 10,000 reads of metadata, but they'd happen in parallel, 100
or more at a time, and all but the first for each file would probably
not require a seek. The job client could be written to do these in
parallel using a thread pool, but I doubt there would be much net job
speedup, since the amount of metadata is small and fits in a packet or two.
It is possible for an error to occur when processing an Avro record.
A RecordReader should never just stop on an error and throw an
exception. When an error occurs the Avro RecordReader should sync to
the next block and continue reading.
I agree that ignoring errors should be possible, but it isn't always best.
The RecordReader needs to be able to get the current position in the
file to determine when it has reached the end of file split.
Yes, DataFileReader() should probably have a tell() method.
Some re-factoring of DataFileReader.java would allow the class to be used
by an Avro RecordReader. The following enhancements are being proposed.
Add a constructor which allows the metadata to be passed as a
parameter. The new proposed constructor will not sync to the end of
the file and read the metadata.
I think AVRO-160 will mostly obviate the need for this.
Add a getBlockCount() method which returns the blockCount. This will
allow the a RecordReader to determine when to stop reading by
checking if the blockCount is zero after reading pass the end of the
file split.
Won't a tell() method be sufficient for this? With block-based i/o and
compression, tell() will generally return the position of the beginning
of the current block, i.e., will not be incremented except when block
boundaries are crossed.
Add a syncReset() method which will sync to the next block marker and
reset the blockCount. This method is used by the record reader to
move to the next block when an error occurs and continue reading.
Wouldn't sync(tell()) implement this?
Add a tell() method which will return the current position in the
file. This is used by the record reader to determine when it has read
past the end of a file split.
+1 This is certainly needed.
If I make these changes, is there interest in have these changes
contributed back to Avro.
Please file a Jira issue for each that you intend to implement and
discussion can continue there.
Cheers,
Doug