I forgot to ask a relevant question: What made the original proposed solution "incompatible"? To me it still seems to be a clean backward compatible solution that fixes this issue in a simple way.
Perhaps Todd can explain why? Niels On May 29, 2014 2:17 PM, "Niels Basjes" <ni...@basjes.nl> wrote: > This is exactly why I'm proposing a change that will either 'fix silently' > (my original patch from 3 years ago) or 'break loudly' (my current > proposal) old implementations. > I'm convinced that ther are atleast 100 companies world wide that have a > custom implementation with this bug and have no clue they have been basing > descision upon silently corrupted data. > > > On Thu, May 29, 2014 at 1:21 PM, Jay Vyas <jayunit...@gmail.com> wrote: > >> I think breaking backwards compat is sensible since It's easily caught by >> the compiler and in this case where the alternative is a >> Runtime error that can result in terabytes of mucked up output. >> >> > On May 29, 2014, at 6:11 AM, Matt Fellows < >> matt.fell...@bespokesoftware.com> wrote: >> > >> > As someone who doesn't really contribute, just lurks, I could well be >> misinformed or under-informed, but I don't see why we can't deprecate a >> method which could cause dangerous side effects? >> > People can still use the deprecated methods for backwards >> compatibility, but are discouraged by compiler warnings, and any changes >> they write to their code can start to use the new functionality? >> > >> > *Apologies if I'm stepping into a Hadoop holy war here >> > >> > >> >> On Thu, May 29, 2014 at 10:47 AM, Niels Basjes <ni...@basjes.nl> >> wrote: >> >> My original proposal (from about 3 years ago) was to change the >> isSplitable >> >> method to return a safe default ( you can see that in the patch that is >> >> still attached to that Jira issue). >> >> For arguments I still do not fully understand this was rejected by >> Todd and >> >> Doug. >> >> >> >> So that is why my new proposal is to deprecate (remove!) the old method >> >> with the typo in Hadoop 3.0 and replace it with something correct and >> less >> >> error prone. >> >> Given the fact that this would happen in a major version jump I thought >> >> that would be the right time to do that. >> >> >> >> Niels >> >> >> >> >> >> On Thu, May 29, 2014 at 11:34 AM, Steve Loughran < >> ste...@hortonworks.com>wrote: >> >> >> >> > On 28 May 2014 20:50, Niels Basjes <ni...@basjes.nl> wrote: >> >> > >> >> > > Hi, >> >> > > >> >> > > Last week I ran into this problem again >> >> > > https://issues.apache.org/jira/browse/MAPREDUCE-2094 >> >> > > >> >> > > What happens here is that the default implementation of the >> isSplitable >> >> > > method in FileInputFormat is so unsafe that just about everyone who >> >> > > implements a new subclass is likely to get this wrong. The effect >> of >> >> > > getting this wrong is that all unit tests succeed and running it >> against >> >> > > 'large' input files (>>64MiB) that are compressed using a >> non-splittable >> >> > > compression (often Gzip) will cause the input to be fed into the >> mappers >> >> > > multiple time (i.e. you get garbage results without ever seeing any >> >> > > errors). >> >> > > >> >> > > Last few days I was at Berlin buzzwords talking to someone about >> this bug >> >> > > >> >> > >> >> > that was me, I recall. >> >> > >> >> > >> >> > > and this resulted in the following proposal which I would like your >> >> > > feedback on. >> >> > > >> >> > > 1) This is a change that will break backwards compatibility >> (deliberate >> >> > > choice). >> >> > > 2) The FileInputFormat will get 3 methods (the old isSplitable >> with the >> >> > > typo of one 't' in the name will disappear): >> >> > > (protected) isSplittableContainer --> true unless compressed >> with >> >> > > non-splittable compression. >> >> > > (protected) isSplittableContent --> abstract, MUST be >> implemented by >> >> > > the subclass >> >> > > (public) isSplittable --> isSplittableContainer && >> >> > > isSplittableContent >> >> > > >> >> > > The idea is that only the isSplittable is used by other classes to >> know >> >> > if >> >> > > this is a splittable file. >> >> > > The effect I hope to get is that a developer writing their own >> >> > > fileinputformat (which I alone have done twice so far) is 'forced' >> and >> >> > > 'helped' getting this right. >> >> > > >> >> > >> >> > I could see making the attributes more explicit would be good -but >> stopping >> >> > everything that exists from working isn't going to fly. >> >> > >> >> > what about some subclass, AbstractSplittableFileInputFormat that >> implements >> >> > the container properly, requires that content one -and then >> calculates >> >> > IsSplitable() from the results? Existing code: no change, new >> formats can >> >> > descend from this (and built in ones retrofitted). >> >> > >> >> > >> >> > >> >> > > The reason for me to propose this as an incompatible change is >> that this >> >> > > way I hope to eradicate some of the existing bugs in custom >> >> > implementations >> >> > > 'out there'. >> >> > > >> >> > > P.S. If you agree to this change then I'm willing to put my back >> into it >> >> > > and submit a patch. >> >> > > >> >> > > -- >> >> > > Best regards, >> >> > > >> >> > > Niels Basjes >> >> > > >> >> > >> >> > -- >> >> > CONFIDENTIALITY NOTICE >> >> > NOTICE: This message is intended for the use of the individual or >> entity to >> >> > which it is addressed and may contain information that is >> confidential, >> >> > privileged and exempt from disclosure under applicable law. If the >> reader >> >> > of this message is not the intended recipient, you are hereby >> notified that >> >> > any printing, copying, dissemination, distribution, disclosure or >> >> > forwarding of this communication is strictly prohibited. If you have >> >> > received this communication in error, please contact the sender >> immediately >> >> > and delete it from your system. Thank You. >> >> > >> >> >> >> >> >> >> >> -- >> >> Best regards / Met vriendelijke groeten, >> >> >> >> Niels Basjes >> > >> > >> > >> > -- >> > >> > First Option Software Ltd >> > Signal House >> > Jacklyns Lane >> > Alresford >> > SO24 9JJ >> > Tel: +44 (0)1962 738232 >> > Mob: +44 (0)7710 160458 >> > Fax: +44 (0)1962 600112 >> > Web: www.bespokesoftware.com >> > >> > ____________________________________________________ >> > >> > This is confidential, non-binding and not company endorsed - see full >> terms at www.fosolutions.co.uk/emailpolicy.html >> > First Option Software Ltd Registered No. 06340261 >> > Signal House, Jacklyns Lane, Alresford, Hampshire, SO24 9JJ, U.K. >> > ____________________________________________________ >> > >> > > > > -- > Best regards / Met vriendelijke groeten, > > Niels Basjes >