The Hadoop framework uses the filename extension  to automatically insert
the "right" decompression codec in the read pipeline.
So if someone does what you describe then they would need to unload all
compression codecs or face decompression errors. And if it really was
gzipped then it would not be splittable at all.

Niels
On May 31, 2014 11:12 PM, "Chris Douglas" <cdoug...@apache.org> wrote:

> On Fri, May 30, 2014 at 11:05 PM, Niels Basjes <ni...@basjes.nl> wrote:
> > How would someone create the situation you are referring to?
>
> By adopting a naming convention where the filename suffix doesn't
> imply that the raw data are compressed with that codec.
>
> For example, if a user named SequenceFiles foo.lzo and foo.gz to
> record which codec was used, then isSplittable would spuriously return
> false. -C
>
> > On May 31, 2014 1:06 AM, "Doug Cutting" <cutt...@apache.org> wrote:
> >
> >> I was trying to explain my comment, where I stated that, "changing the
> >> default implementation to return false would be an incompatible
> >> change".  The patch was added 6 months after that comment, so the
> >> comment didn't address the patch.
> >>
> >> The patch does not appear to change the default implementation to
> >> return false unless the suffix of the file name is that of a known
> >> unsplittable compression format.  So the folks who'd be harmed by this
> >> are those who used a suffix like ".gz" for an Avro, Parquet or
> >> other-format file.  Their applications might suddenly run much slower
> >> and it would be difficult for them to determine why.  Such folks are
> >> probably few, but perhaps exist.  I'd prefer a change that avoided
> >> that possibility entirely.
> >>
> >> Doug
> >>
> >> On Fri, May 30, 2014 at 3:02 PM, Niels Basjes <ni...@basjes.nl> wrote:
> >> > Hi,
> >> >
> >> > The way I see the effects of the original patch on existing
> subclasses:
> >> > - implemented isSplitable
> >> >    --> no performance difference.
> >> > - did not implement isSplitable
> >> >    --> then there is no performance difference if the container is
> either
> >> > not compressed or uses a splittable compression.
> >> >    --> If it uses a common non splittable compression (like gzip) then
> >> the
> >> > output will suddenly be different (which is the correct answer) and
> the
> >> > jobs will finish sooner because the input is not processed multiple
> >> times.
> >> >
> >> > Where do you see a performance impact?
> >> >
> >> > Niels
> >> > On May 30, 2014 8:06 PM, "Doug Cutting" <cutt...@apache.org> wrote:
> >> >
> >> >> On Thu, May 29, 2014 at 2:47 AM, Niels Basjes <ni...@basjes.nl>
> wrote:
> >> >> > For arguments I still do not fully understand this was rejected by
> >> Todd
> >> >> and
> >> >> > Doug.
> >> >>
> >> >> Performance is a part of compatibility.
> >> >>
> >> >> Doug
> >> >>
> >>
>

Reply via email to