I could be missing something, but couldn't you just deprecate isSplitable (spelled incorrectly) and create a new isSplittable as described?
On Thu, May 29, 2014 at 10:34 AM, Steve Loughran <ste...@hortonworks.com> wrote: > On 28 May 2014 20:50, Niels Basjes <ni...@basjes.nl> wrote: > > > Hi, > > > > Last week I ran into this problem again > > https://issues.apache.org/jira/browse/MAPREDUCE-2094 > > > > What happens here is that the default implementation of the isSplitable > > method in FileInputFormat is so unsafe that just about everyone who > > implements a new subclass is likely to get this wrong. The effect of > > getting this wrong is that all unit tests succeed and running it against > > 'large' input files (>>64MiB) that are compressed using a non-splittable > > compression (often Gzip) will cause the input to be fed into the mappers > > multiple time (i.e. you get garbage results without ever seeing any > > errors). > > > > Last few days I was at Berlin buzzwords talking to someone about this bug > > > > that was me, I recall. > > > > and this resulted in the following proposal which I would like your > > feedback on. > > > > 1) This is a change that will break backwards compatibility (deliberate > > choice). > > 2) The FileInputFormat will get 3 methods (the old isSplitable with the > > typo of one 't' in the name will disappear): > > (protected) isSplittableContainer --> true unless compressed with > > non-splittable compression. > > (protected) isSplittableContent --> abstract, MUST be implemented by > > the subclass > > (public) isSplittable --> isSplittableContainer && > > isSplittableContent > > > > The idea is that only the isSplittable is used by other classes to know > if > > this is a splittable file. > > The effect I hope to get is that a developer writing their own > > fileinputformat (which I alone have done twice so far) is 'forced' and > > 'helped' getting this right. > > > > I could see making the attributes more explicit would be good -but stopping > everything that exists from working isn't going to fly. > > what about some subclass, AbstractSplittableFileInputFormat that implements > the container properly, requires that content one -and then calculates > IsSplitable() from the results? Existing code: no change, new formats can > descend from this (and built in ones retrofitted). > > > > > The reason for me to propose this as an incompatible change is that this > > way I hope to eradicate some of the existing bugs in custom > implementations > > 'out there'. > > > > P.S. If you agree to this change then I'm willing to put my back into it > > and submit a patch. > > > > -- > > Best regards, > > > > Niels Basjes > > > > -- > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity to > which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You. > -- [image: cid:1CBF4038-3F0F-4FC2-A1FF-6DC81B8B6F94] First Option Software Ltd Signal House Jacklyns Lane Alresford SO24 9JJ Tel: +44 (0)1962 738232 Mob: +44 (0)7710 160458 Fax: +44 (0)1962 600112 Web: www.b <http://www.fosolutions.co.uk/>espokesoftware.com <http://bespokesoftware.com/> -- ____________________________________________________ This is confidential, non-binding and not company endorsed - see full terms at www.fosolutions.co.uk/emailpolicy.html First Option Software Ltd Registered No. 06340261 Signal House, Jacklyns Lane, Alresford, Hampshire, SO24 9JJ, U.K. ____________________________________________________