On Mon, Jun 2, 2014 at 1:21 AM, Chris Douglas <cdoug...@apache.org> wrote:

> On Sat, May 31, 2014 at 10:53 PM, Niels Basjes <ni...@basjes.nl> wrote:
> > The Hadoop framework uses the filename extension  to automatically insert
> > the "right" decompression codec in the read pipeline.
>
> This would be the new behavior, incompatible with existing code.
>

You are right, I was wrong. It is the LineRecordReader that inserts it.

Looking at this code and where it is used I noticed that the bug I'm trying
to prevent is present in the current trunk.
The NLineInputFormat does not override the isSplitable and used the
LineRecordReader that is capable of reading gzipped input. Overall effect
is that this inputformat silently produces garbage (missing lines +
duplicated lines) when when ran against a gzipped file. I just verified
this.

> So if someone does what you describe then they would need to unload all
> compression codecs or face decompression errors. And if it really was
> gzipped then it would not be splittable at all.

Assume an InputFormat configured for a job assumes that isSplitable
> returns true because it extends FileInputFormat. After the change, it
> could spuriously return false based on the suffix of the input files.
> In the prenominate example, SequenceFile is splittable, even if the
> codec used in each block is not. -C
>

and if you then give the file the .gz extension this breaks all common
sense / conventions about file names.


Let's reiterate the options I see now:
1) isSplitable --> return true
    Too unsafe, I say "must change". I alone hit my head twice so far on
this, many others have too, event the current trunk still has this bug in
there.

2) isSplitable --> return false
    Safe but too slow in some cases. In those cases the actual
implementation can simply override it very easily and regain their original
performance.

3) isSplitable --> true (same as the current implementation) unless you use
a file extension that is associated with a non splittable compression codec
(i.e.  .gz or something like that).
    If a custom format want to break with well known conventions about
filenames then they should simply override the isSplitable with their own.

4) isSplitable --> abstract
    Compatibility breaker. I see this as the cleanest way to force the
developer of the custom fileinputformat to think about their specific case.

I hold "correct data" much higher than performance and scalability; so the
performance impact is a concern but it is much less important than the list
of bugs we are facing right now.

The safest way would be either 2 or 4. Solution 3 would effectively be the
same as the current implementation, yet it would catch the problem
situations as long as people stick to normal file name conventions.
Solution 3 would also allow removing some code duplication in several
subclasses.

I would go for solution 3.

Niels Basjes

Reply via email to