Change proposal for FileInputFormat isSplitable

2014-05-28 Thread Niels Basjes
Hi, Last week I ran into this problem again https://issues.apache.org/jira/browse/MAPREDUCE-2094 What happens here is that the default implementation of the isSplitable method in FileInputFormat is so unsafe that just about everyone who implements a new subclass is likely to get this wrong. The e

Re: Change proposal for FileInputFormat isSplitable

2014-05-29 Thread Steve Loughran
On 28 May 2014 20:50, Niels Basjes wrote: > Hi, > > Last week I ran into this problem again > https://issues.apache.org/jira/browse/MAPREDUCE-2094 > > What happens here is that the default implementation of the isSplitable > method in FileInputFormat is so unsafe that just about everyone who > im

Re: Change proposal for FileInputFormat isSplitable

2014-05-29 Thread Matt Fellows
I could be missing something, but couldn't you just deprecate isSplitable (spelled incorrectly) and create a new isSplittable as described? On Thu, May 29, 2014 at 10:34 AM, Steve Loughran wrote: > On 28 May 2014 20:50, Niels Basjes wrote: > > > Hi, > > > > Last week I ran into this problem ag

Re: Change proposal for FileInputFormat isSplitable

2014-05-29 Thread Niels Basjes
My original proposal (from about 3 years ago) was to change the isSplitable method to return a safe default ( you can see that in the patch that is still attached to that Jira issue). For arguments I still do not fully understand this was rejected by Todd and Doug. So that is why my new proposal i

Re: Change proposal for FileInputFormat isSplitable

2014-05-29 Thread Matt Fellows
As someone who doesn't really contribute, just lurks, I could well be misinformed or under-informed, but I don't see why we can't deprecate a method which could cause dangerous side effects? People can still use the deprecated methods for backwards compatibility, but are discouraged by compiler war

Re: Change proposal for FileInputFormat isSplitable

2014-05-29 Thread Jay Vyas
I think breaking backwards compat is sensible since It's easily caught by the compiler and in this case where the alternative is a Runtime error that can result in terabytes of mucked up output. > On May 29, 2014, at 6:11 AM, Matt Fellows > wrote: > > As someone who doesn't really contribute

Re: Change proposal for FileInputFormat isSplitable

2014-05-29 Thread Niels Basjes
This is exactly why I'm proposing a change that will either 'fix silently' (my original patch from 3 years ago) or 'break loudly' (my current proposal) old implementations. I'm convinced that ther are atleast 100 companies world wide that have a custom implementation with this bug and have no clue

Re: Change proposal for FileInputFormat isSplitable

2014-05-29 Thread Niels Basjes
I forgot to ask a relevant question: What made the original proposed solution "incompatible"? To me it still seems to be a clean backward compatible solution that fixes this issue in a simple way. Perhaps Todd can explain why? Niels On May 29, 2014 2:17 PM, "Niels Basjes" wrote: > This is exact

Re: Change proposal for FileInputFormat isSplitable

2014-05-30 Thread Doug Cutting
On Thu, May 29, 2014 at 2:47 AM, Niels Basjes wrote: > For arguments I still do not fully understand this was rejected by Todd and > Doug. Performance is a part of compatibility. Doug

Re: Change proposal for FileInputFormat isSplitable

2014-05-30 Thread Niels Basjes
Hi, The way I see the effects of the original patch on existing subclasses: - implemented isSplitable --> no performance difference. - did not implement isSplitable --> then there is no performance difference if the container is either not compressed or uses a splittable compression. -->

Re: Change proposal for FileInputFormat isSplitable

2014-05-30 Thread Tim
Remove On Fri, May 30, 2014 at 3:03 PM, Niels Basjes wrote: > Hi, > The way I see the effects of the original patch on existing subclasses: > - implemented isSplitable >--> no performance difference. > - did not implement isSplitable >--> then there is no performance difference if the co

Re: Change proposal for FileInputFormat isSplitable

2014-05-30 Thread Doug Cutting
I was trying to explain my comment, where I stated that, "changing the default implementation to return false would be an incompatible change". The patch was added 6 months after that comment, so the comment didn't address the patch. The patch does not appear to change the default implementation

Re: Change proposal for FileInputFormat isSplitable

2014-05-30 Thread Niels Basjes
Ok, got it. If someone has an Avro file (foo.avro) and gzips that ( foo.avro.gz) then the frame work will select the GzipCodec which is not capable of splitting and which will cause the problem. So by gzipping a splittable file it becomes non splittable. At my workplace we have applied gzip to av

Re: Change proposal for FileInputFormat isSplitable

2014-05-31 Thread Chris Douglas
On Fri, May 30, 2014 at 11:05 PM, Niels Basjes wrote: > How would someone create the situation you are referring to? By adopting a naming convention where the filename suffix doesn't imply that the raw data are compressed with that codec. For example, if a user named SequenceFiles foo.lzo and fo

Re: Change proposal for FileInputFormat isSplitable

2014-05-31 Thread Niels Basjes
The Hadoop framework uses the filename extension to automatically insert the "right" decompression codec in the read pipeline. So if someone does what you describe then they would need to unload all compression codecs or face decompression errors. And if it really was gzipped then it would not be

Re: Change proposal for FileInputFormat isSplitable

2014-06-01 Thread Chris Douglas
On Sat, May 31, 2014 at 10:53 PM, Niels Basjes wrote: > The Hadoop framework uses the filename extension to automatically insert > the "right" decompression codec in the read pipeline. This would be the new behavior, incompatible with existing code. > So if someone does what you describe then t

Re: Change proposal for FileInputFormat isSplitable

2014-06-05 Thread Doug Cutting
On Fri, May 30, 2014 at 11:05 PM, Niels Basjes wrote: > How would someone create the situation you are referring to? By renaming files. I believe I can easily write a test using public APIs that will succeed without this patch and will fail with this patch. Given the number of Hadoop-based appl

Re: Change proposal for FileInputFormat isSplitable

2014-06-06 Thread Niels Basjes
On Mon, Jun 2, 2014 at 1:21 AM, Chris Douglas wrote: > On Sat, May 31, 2014 at 10:53 PM, Niels Basjes wrote: > > The Hadoop framework uses the filename extension to automatically insert > > the "right" decompression codec in the read pipeline. > > This would be the new behavior, incompatible wi

Re: Change proposal for FileInputFormat isSplitable

2014-06-10 Thread Chris Douglas
On Fri, Jun 6, 2014 at 4:03 PM, Niels Basjes wrote: > and if you then give the file the .gz extension this breaks all common > sense / conventions about file names. That the suffix for all compression codecs in every context- and all future codecs- should determine whether a file can be split is

Re: Change proposal for FileInputFormat isSplitable

2014-06-11 Thread Niels Basjes
On Tue, Jun 10, 2014 at 8:10 PM, Chris Douglas wrote: > On Fri, Jun 6, 2014 at 4:03 PM, Niels Basjes wrote: > > and if you then give the file the .gz extension this breaks all common > > sense / conventions about file names. > > That the suffix for all compression codecs in every context- and

Re: Change proposal for FileInputFormat isSplitable

2014-06-11 Thread Chris Douglas
On Wed, Jun 11, 2014 at 1:35 AM, Niels Basjes wrote: > That's not what I meant. What I understood from what was described is that > sometimes people use an existing file extension (like .gz) for a file that > is not a gzipped file. Understood, but this change also applies to other loaded codecs,

Re: Change proposal for FileInputFormat isSplitable

2014-06-13 Thread Niels Basjes
Hi, On Wed, Jun 11, 2014 at 8:25 PM, Chris Douglas wrote: > On Wed, Jun 11, 2014 at 1:35 AM, Niels Basjes wrote: > > That's not what I meant. What I understood from what was described is > that > > sometimes people use an existing file extension (like .gz) for a file > that > > is not a gzipped

Re: Change proposal for FileInputFormat isSplitable

2014-06-13 Thread Chris Douglas
On Fri, Jun 13, 2014 at 2:54 AM, Niels Basjes wrote: > Hmmm, people only look at logs when they have a problem. So I don't think > this would be enough. This change to the framework will cause disruptions to users, to aid InputFormat authors' debugging. The latter is a much smaller population and

Re: Change proposal for FileInputFormat isSplitable

2014-06-14 Thread Niels Basjes
I did some digging through the code base and inspected all the situations I know where this goes wrong (including the yahoo tutorial) and found a place that may be a spot to avoid the effects of this problem. (Instead of solving the cause the problem) It turns out that all of those use cases use t

Re: Change proposal for FileInputFormat isSplitable

2014-07-30 Thread Niels Basjes
Hi, I talked to some people and they agreed with me that really the situation where this problem occurs is when they build a FileInputFormat derivative that also uses a LineRecordReader derivative. This is exactly the scenario that occurs if someone follows the Yahoo Hadoop tutorial. Instead of c