On Fri, May 30, 2014 at 11:05 PM, Niels Basjes <ni...@basjes.nl> wrote: > How would someone create the situation you are referring to?
By adopting a naming convention where the filename suffix doesn't imply that the raw data are compressed with that codec. For example, if a user named SequenceFiles foo.lzo and foo.gz to record which codec was used, then isSplittable would spuriously return false. -C > On May 31, 2014 1:06 AM, "Doug Cutting" <cutt...@apache.org> wrote: > >> I was trying to explain my comment, where I stated that, "changing the >> default implementation to return false would be an incompatible >> change". The patch was added 6 months after that comment, so the >> comment didn't address the patch. >> >> The patch does not appear to change the default implementation to >> return false unless the suffix of the file name is that of a known >> unsplittable compression format. So the folks who'd be harmed by this >> are those who used a suffix like ".gz" for an Avro, Parquet or >> other-format file. Their applications might suddenly run much slower >> and it would be difficult for them to determine why. Such folks are >> probably few, but perhaps exist. I'd prefer a change that avoided >> that possibility entirely. >> >> Doug >> >> On Fri, May 30, 2014 at 3:02 PM, Niels Basjes <ni...@basjes.nl> wrote: >> > Hi, >> > >> > The way I see the effects of the original patch on existing subclasses: >> > - implemented isSplitable >> > --> no performance difference. >> > - did not implement isSplitable >> > --> then there is no performance difference if the container is either >> > not compressed or uses a splittable compression. >> > --> If it uses a common non splittable compression (like gzip) then >> the >> > output will suddenly be different (which is the correct answer) and the >> > jobs will finish sooner because the input is not processed multiple >> times. >> > >> > Where do you see a performance impact? >> > >> > Niels >> > On May 30, 2014 8:06 PM, "Doug Cutting" <cutt...@apache.org> wrote: >> > >> >> On Thu, May 29, 2014 at 2:47 AM, Niels Basjes <ni...@basjes.nl> wrote: >> >> > For arguments I still do not fully understand this was rejected by >> Todd >> >> and >> >> > Doug. >> >> >> >> Performance is a part of compatibility. >> >> >> >> Doug >> >> >>