This is exactly why I'm proposing a change that will either 'fix silently' (my original patch from 3 years ago) or 'break loudly' (my current proposal) old implementations. I'm convinced that ther are atleast 100 companies world wide that have a custom implementation with this bug and have no clue they have been basing descision upon silently corrupted data.
On Thu, May 29, 2014 at 1:21 PM, Jay Vyas <jayunit...@gmail.com> wrote: > I think breaking backwards compat is sensible since It's easily caught by > the compiler and in this case where the alternative is a > Runtime error that can result in terabytes of mucked up output. > > > On May 29, 2014, at 6:11 AM, Matt Fellows < > matt.fell...@bespokesoftware.com> wrote: > > > > As someone who doesn't really contribute, just lurks, I could well be > misinformed or under-informed, but I don't see why we can't deprecate a > method which could cause dangerous side effects? > > People can still use the deprecated methods for backwards compatibility, > but are discouraged by compiler warnings, and any changes they write to > their code can start to use the new functionality? > > > > *Apologies if I'm stepping into a Hadoop holy war here > > > > > >> On Thu, May 29, 2014 at 10:47 AM, Niels Basjes <ni...@basjes.nl> wrote: > >> My original proposal (from about 3 years ago) was to change the > isSplitable > >> method to return a safe default ( you can see that in the patch that is > >> still attached to that Jira issue). > >> For arguments I still do not fully understand this was rejected by Todd > and > >> Doug. > >> > >> So that is why my new proposal is to deprecate (remove!) the old method > >> with the typo in Hadoop 3.0 and replace it with something correct and > less > >> error prone. > >> Given the fact that this would happen in a major version jump I thought > >> that would be the right time to do that. > >> > >> Niels > >> > >> > >> On Thu, May 29, 2014 at 11:34 AM, Steve Loughran < > ste...@hortonworks.com>wrote: > >> > >> > On 28 May 2014 20:50, Niels Basjes <ni...@basjes.nl> wrote: > >> > > >> > > Hi, > >> > > > >> > > Last week I ran into this problem again > >> > > https://issues.apache.org/jira/browse/MAPREDUCE-2094 > >> > > > >> > > What happens here is that the default implementation of the > isSplitable > >> > > method in FileInputFormat is so unsafe that just about everyone who > >> > > implements a new subclass is likely to get this wrong. The effect of > >> > > getting this wrong is that all unit tests succeed and running it > against > >> > > 'large' input files (>>64MiB) that are compressed using a > non-splittable > >> > > compression (often Gzip) will cause the input to be fed into the > mappers > >> > > multiple time (i.e. you get garbage results without ever seeing any > >> > > errors). > >> > > > >> > > Last few days I was at Berlin buzzwords talking to someone about > this bug > >> > > > >> > > >> > that was me, I recall. > >> > > >> > > >> > > and this resulted in the following proposal which I would like your > >> > > feedback on. > >> > > > >> > > 1) This is a change that will break backwards compatibility > (deliberate > >> > > choice). > >> > > 2) The FileInputFormat will get 3 methods (the old isSplitable with > the > >> > > typo of one 't' in the name will disappear): > >> > > (protected) isSplittableContainer --> true unless compressed > with > >> > > non-splittable compression. > >> > > (protected) isSplittableContent --> abstract, MUST be > implemented by > >> > > the subclass > >> > > (public) isSplittable --> isSplittableContainer && > >> > > isSplittableContent > >> > > > >> > > The idea is that only the isSplittable is used by other classes to > know > >> > if > >> > > this is a splittable file. > >> > > The effect I hope to get is that a developer writing their own > >> > > fileinputformat (which I alone have done twice so far) is 'forced' > and > >> > > 'helped' getting this right. > >> > > > >> > > >> > I could see making the attributes more explicit would be good -but > stopping > >> > everything that exists from working isn't going to fly. > >> > > >> > what about some subclass, AbstractSplittableFileInputFormat that > implements > >> > the container properly, requires that content one -and then calculates > >> > IsSplitable() from the results? Existing code: no change, new formats > can > >> > descend from this (and built in ones retrofitted). > >> > > >> > > >> > > >> > > The reason for me to propose this as an incompatible change is that > this > >> > > way I hope to eradicate some of the existing bugs in custom > >> > implementations > >> > > 'out there'. > >> > > > >> > > P.S. If you agree to this change then I'm willing to put my back > into it > >> > > and submit a patch. > >> > > > >> > > -- > >> > > Best regards, > >> > > > >> > > Niels Basjes > >> > > > >> > > >> > -- > >> > CONFIDENTIALITY NOTICE > >> > NOTICE: This message is intended for the use of the individual or > entity to > >> > which it is addressed and may contain information that is > confidential, > >> > privileged and exempt from disclosure under applicable law. If the > reader > >> > of this message is not the intended recipient, you are hereby > notified that > >> > any printing, copying, dissemination, distribution, disclosure or > >> > forwarding of this communication is strictly prohibited. If you have > >> > received this communication in error, please contact the sender > immediately > >> > and delete it from your system. Thank You. > >> > > >> > >> > >> > >> -- > >> Best regards / Met vriendelijke groeten, > >> > >> Niels Basjes > > > > > > > > -- > > > > First Option Software Ltd > > Signal House > > Jacklyns Lane > > Alresford > > SO24 9JJ > > Tel: +44 (0)1962 738232 > > Mob: +44 (0)7710 160458 > > Fax: +44 (0)1962 600112 > > Web: www.bespokesoftware.com > > > > ____________________________________________________ > > > > This is confidential, non-binding and not company endorsed - see full > terms at www.fosolutions.co.uk/emailpolicy.html > > First Option Software Ltd Registered No. 06340261 > > Signal House, Jacklyns Lane, Alresford, Hampshire, SO24 9JJ, U.K. > > ____________________________________________________ > > > -- Best regards / Met vriendelijke groeten, Niels Basjes