Re: Enhancing DataFileReader.java to support a Hadoop Import Formatter and RecordReader

Robert Goodman Thu, 10 Dec 2009 16:01:52 -0800

Doug,

Thanks for your response, I will open a Jira for adding a tell() method
with a description of how it would be used by a Record Reader to start the
discussion. With Christmas vacation I won't be able to provide a patch
until the beginning of January.


   Bob


Robert Goodman
 IBM Big Sheets

Doug Cutting <cutt...@apache.org> wrote on 12/08/2009 04:15:53 PM:

> Doug Cutting <cutt...@apache.org>
> 12/08/2009 04:15 PM
>
> Please respond to
> avro-dev@hadoop.apache.org
>
> To
>
> avro-dev@hadoop.apache.org
>
> cc
>
> Subject
>
> Re: Enhancing DataFileReader.java to support a Hadoop Import
> Formatter and RecordReader
>
> Robert Goodman wrote:
> >      In a Avro formatted file the metadata is at the end of the file
and
> >      must be read to get the schema, codec, etc. If this is done by
every
> >      RecordReader when processing a file split, it will be an expensive
> >      operation. The RecordReader would have to access data which not
> >      likely be local to the node. As an optimization the Input
Formatter
> >      can open the file and read the metadata when determining the file
> >      splits. The metadata information can be put in the InputSplit
which
> >      is passed to the Avro RecordReader.
> >
> >      In processing a file split a Record Reader must sync to the start
of
> >      a block and read past the end of the file split until it reaches
the
> >      end of the current block. The record reader needs to know when it
has
> >      reached a block boundary.
>
> Note that Hadoop's SequenceFile currently reads a small amount of
> metadata from the start of files but that this does not seem to affect
> performance much.
>
> You are right to note that Avro's data file currently requires metadata
> to be also read from the end of the file.  But there's currently a
> proposal to change Avro's data file format:
>
>    https://issues.apache.org/jira/browse/AVRO-160
>
> In short, the plan is to put metadata at the front of each block in the
> file.  Mapreduce applications would still need to read the sync marker
> from the head of the file before processing a split, but would no longer
> need to also read metadata from the end of the file.
>
> Your proposal to read sync markers and/or metadata when constructing
> splits has the downside that it could serialize something that's
> otherwise done in parallel.  For example, let's assume your jobs has
> 1000 input files each with 10 splits on a 100 node cluster.  With your
> proposal you'd need to open and read the headers of 1000 files in the
> client at job submit time.  Having map tasks read these instead would
> result in 10,000 reads of metadata, but they'd happen in parallel, 100
> or more at a time, and all but the first for each file would probably
> not require a seek.  The job client could be written to do these in
> parallel using a thread pool, but I doubt there would be much net job
> speedup, since the amount of metadata is small and fits in a packet or
two.
>
> >      It is possible for an error to occur when processing an Avro
record.
> >      A RecordReader should never just stop on an error and throw an
> >      exception. When an error occurs the Avro RecordReader should sync
to
> >      the next block and continue reading.
>
> I agree that ignoring errors should be possible, but it isn't always
best.
>
> >      The RecordReader needs to be able to get the current position in
the
> >      file to determine when it has reached the end of file split.
>
> Yes, DataFileReader() should probably have a tell() method.
>
> > Some re-factoring of DataFileReader.java would allow the class to be
used
> > by an Avro RecordReader. The following enhancements are being proposed.
> >
> >      Add a constructor which allows the metadata to be passed as a
> >      parameter. The new proposed constructor will not sync to the end
of
> >      the file and read the metadata.
>
> I think AVRO-160 will mostly obviate the need for this.
>
> >      Add a getBlockCount() method which returns the blockCount. This
will
> >      allow the a RecordReader to determine when to stop reading by
> >      checking if the blockCount is zero after reading pass the end of
the
> >      file split.
>
> Won't a tell() method be sufficient for this?  With block-based i/o and
> compression, tell() will generally return the position of the beginning
> of the current block, i.e., will not be incremented except when block
> boundaries are crossed.
>
> >      Add a syncReset() method which will sync to the next block marker
and
> >      reset the blockCount. This method is used by the record reader to
> >      move to the next block when an error occurs and continue reading.
>
> Wouldn't sync(tell()) implement this?
>
> >      Add a tell() method which will return the current position in the
> >      file. This is used by the record reader to determine when it has
read
> >      past the end of a file split.
>
> +1 This is certainly needed.
>
> > If I make these changes, is there interest in have these changes
> > contributed back to Avro.
>
> Please file a Jira issue for each that you intend to implement and
> discussion can continue there.
>
> Cheers,
>
> Doug
>

Re: Enhancing DataFileReader.java to support a Hadoop Import Formatter and RecordReader

Reply via email to