Re: Enhancing DataFileReader.java to support a Hadoop Import Formatter and RecordReader

Doug Cutting Thu, 10 Dec 2009 16:13:36 -0800

Robert,

Sounds good. I hope to implement AVRO-160 before the year's end. I maywell integrate a tell() method into that, we'll see. But please filethe Jira issue so that we don't lose track of that need.


Cheers,

Doug

Robert Goodman wrote:

Doug,

Thanks for your response, I will open a Jira for adding a tell() method
with a description of how it would be used by a Record Reader to start the
discussion. With Christmas vacation I won't be able to provide a patch
until the beginning of January.

   Bob


Robert Goodman
 IBM Big Sheets

Doug Cutting <cutt...@apache.org> wrote on 12/08/2009 04:15:53 PM:

Doug Cutting <cutt...@apache.org>
12/08/2009 04:15 PM

Please respond to
avro-dev@hadoop.apache.org

To

avro-dev@hadoop.apache.org

cc

Subject

Re: Enhancing DataFileReader.java to support a Hadoop Import
Formatter and RecordReader

Robert Goodman wrote:

     In a Avro formatted file the metadata is at the end of the file

and

     must be read to get the schema, codec, etc. If this is done by

every

     RecordReader when processing a file split, it will be an expensive
     operation. The RecordReader would have to access data which not
     likely be local to the node. As an optimization the Input

Formatter

     can open the file and read the metadata when determining the file
     splits. The metadata information can be put in the InputSplit

which

     is passed to the Avro RecordReader.

     In processing a file split a Record Reader must sync to the start

of

     a block and read past the end of the file split until it reaches

the

     end of the current block. The record reader needs to know when it

has

     reached a block boundary.

Note that Hadoop's SequenceFile currently reads a small amount of
metadata from the start of files but that this does not seem to affect
performance much.

You are right to note that Avro's data file currently requires metadata
to be also read from the end of the file.  But there's currently a
proposal to change Avro's data file format:

   https://issues.apache.org/jira/browse/AVRO-160

In short, the plan is to put metadata at the front of each block in the
file.  Mapreduce applications would still need to read the sync marker
from the head of the file before processing a split, but would no longer
need to also read metadata from the end of the file.

Your proposal to read sync markers and/or metadata when constructing
splits has the downside that it could serialize something that's
otherwise done in parallel.  For example, let's assume your jobs has
1000 input files each with 10 splits on a 100 node cluster.  With your
proposal you'd need to open and read the headers of 1000 files in the
client at job submit time.  Having map tasks read these instead would
result in 10,000 reads of metadata, but they'd happen in parallel, 100
or more at a time, and all but the first for each file would probably
not require a seek.  The job client could be written to do these in
parallel using a thread pool, but I doubt there would be much net job
speedup, since the amount of metadata is small and fits in a packet or

two.

     It is possible for an error to occur when processing an Avro

record.

     A RecordReader should never just stop on an error and throw an
     exception. When an error occurs the Avro RecordReader should sync

to

     the next block and continue reading.

I agree that ignoring errors should be possible, but it isn't always

best.

     The RecordReader needs to be able to get the current position in

the

     file to determine when it has reached the end of file split.

Yes, DataFileReader() should probably have a tell() method.

Some re-factoring of DataFileReader.java would allow the class to be

used

by an Avro RecordReader. The following enhancements are being proposed.

     Add a constructor which allows the metadata to be passed as a
     parameter. The new proposed constructor will not sync to the end

of

     the file and read the metadata.

I think AVRO-160 will mostly obviate the need for this.

     Add a getBlockCount() method which returns the blockCount. This

will

     allow the a RecordReader to determine when to stop reading by
     checking if the blockCount is zero after reading pass the end of

the

     file split.

Won't a tell() method be sufficient for this?  With block-based i/o and
compression, tell() will generally return the position of the beginning
of the current block, i.e., will not be incremented except when block
boundaries are crossed.

     Add a syncReset() method which will sync to the next block marker

and

     reset the blockCount. This method is used by the record reader to
     move to the next block when an error occurs and continue reading.

Wouldn't sync(tell()) implement this?

     Add a tell() method which will return the current position in the
     file. This is used by the record reader to determine when it has

read

     past the end of a file split.

+1 This is certainly needed.

If I make these changes, is there interest in have these changes
contributed back to Avro.

Please file a Jira issue for each that you intend to implement and
discussion can continue there.

Cheers,

Doug

Re: Enhancing DataFileReader.java to support a Hadoop Import Formatter and RecordReader

Reply via email to