Robert,
Sounds good. I hope to implement AVRO-160 before the year's end. I may
well integrate a tell() method into that, we'll see. But please file
the Jira issue so that we don't lose track of that need.
Cheers,
Doug
Robert Goodman wrote:
Doug,
Thanks for your response, I will open a Jira for adding a tell() method
with a description of how it would be used by a Record Reader to start the
discussion. With Christmas vacation I won't be able to provide a patch
until the beginning of January.
Bob
Robert Goodman
IBM Big Sheets
Doug Cutting <cutt...@apache.org> wrote on 12/08/2009 04:15:53 PM:
Doug Cutting <cutt...@apache.org>
12/08/2009 04:15 PM
Please respond to
avro-dev@hadoop.apache.org
To
avro-dev@hadoop.apache.org
cc
Subject
Re: Enhancing DataFileReader.java to support a Hadoop Import
Formatter and RecordReader
Robert Goodman wrote:
In a Avro formatted file the metadata is at the end of the file
and
must be read to get the schema, codec, etc. If this is done by
every
RecordReader when processing a file split, it will be an expensive
operation. The RecordReader would have to access data which not
likely be local to the node. As an optimization the Input
Formatter
can open the file and read the metadata when determining the file
splits. The metadata information can be put in the InputSplit
which
is passed to the Avro RecordReader.
In processing a file split a Record Reader must sync to the start
of
a block and read past the end of the file split until it reaches
the
end of the current block. The record reader needs to know when it
has
reached a block boundary.
Note that Hadoop's SequenceFile currently reads a small amount of
metadata from the start of files but that this does not seem to affect
performance much.
You are right to note that Avro's data file currently requires metadata
to be also read from the end of the file. But there's currently a
proposal to change Avro's data file format:
https://issues.apache.org/jira/browse/AVRO-160
In short, the plan is to put metadata at the front of each block in the
file. Mapreduce applications would still need to read the sync marker
from the head of the file before processing a split, but would no longer
need to also read metadata from the end of the file.
Your proposal to read sync markers and/or metadata when constructing
splits has the downside that it could serialize something that's
otherwise done in parallel. For example, let's assume your jobs has
1000 input files each with 10 splits on a 100 node cluster. With your
proposal you'd need to open and read the headers of 1000 files in the
client at job submit time. Having map tasks read these instead would
result in 10,000 reads of metadata, but they'd happen in parallel, 100
or more at a time, and all but the first for each file would probably
not require a seek. The job client could be written to do these in
parallel using a thread pool, but I doubt there would be much net job
speedup, since the amount of metadata is small and fits in a packet or
two.
It is possible for an error to occur when processing an Avro
record.
A RecordReader should never just stop on an error and throw an
exception. When an error occurs the Avro RecordReader should sync
to
the next block and continue reading.
I agree that ignoring errors should be possible, but it isn't always
best.
The RecordReader needs to be able to get the current position in
the
file to determine when it has reached the end of file split.
Yes, DataFileReader() should probably have a tell() method.
Some re-factoring of DataFileReader.java would allow the class to be
used
by an Avro RecordReader. The following enhancements are being proposed.
Add a constructor which allows the metadata to be passed as a
parameter. The new proposed constructor will not sync to the end
of
the file and read the metadata.
I think AVRO-160 will mostly obviate the need for this.
Add a getBlockCount() method which returns the blockCount. This
will
allow the a RecordReader to determine when to stop reading by
checking if the blockCount is zero after reading pass the end of
the
file split.
Won't a tell() method be sufficient for this? With block-based i/o and
compression, tell() will generally return the position of the beginning
of the current block, i.e., will not be incremented except when block
boundaries are crossed.
Add a syncReset() method which will sync to the next block marker
and
reset the blockCount. This method is used by the record reader to
move to the next block when an error occurs and continue reading.
Wouldn't sync(tell()) implement this?
Add a tell() method which will return the current position in the
file. This is used by the record reader to determine when it has
read
past the end of a file split.
+1 This is certainly needed.
If I make these changes, is there interest in have these changes
contributed back to Avro.
Please file a Jira issue for each that you intend to implement and
discussion can continue there.
Cheers,
Doug