This is exactly why I'm proposing a change that will either 'fix silently'
(my original patch from 3 years ago) or 'break loudly' (my current
proposal) old implementations.
I'm convinced that ther are atleast 100 companies world wide that have a
custom implementation with this bug and have no clue they have been basing
descision upon silently corrupted data.


On Thu, May 29, 2014 at 1:21 PM, Jay Vyas <jayunit...@gmail.com> wrote:

> I think breaking backwards compat is sensible since It's easily caught by
> the compiler and  in this case where the alternative is a
> Runtime error that can result in terabytes of mucked up output.
>
> > On May 29, 2014, at 6:11 AM, Matt Fellows <
> matt.fell...@bespokesoftware.com> wrote:
> >
> > As someone who doesn't really contribute, just lurks, I could well be
> misinformed or under-informed, but I don't see why we can't deprecate a
> method which could cause dangerous side effects?
> > People can still use the deprecated methods for backwards compatibility,
> but are discouraged by compiler warnings, and any changes they write to
> their code can start to use the new functionality?
> >
> > *Apologies if I'm stepping into a Hadoop holy war here
> >
> >
> >> On Thu, May 29, 2014 at 10:47 AM, Niels Basjes <ni...@basjes.nl> wrote:
> >> My original proposal (from about 3 years ago) was to change the
> isSplitable
> >> method to return a safe default ( you can see that in the patch that is
> >> still attached to that Jira issue).
> >> For arguments I still do not fully understand this was rejected by Todd
> and
> >> Doug.
> >>
> >> So that is why my new proposal is to deprecate (remove!) the old method
> >> with the typo in Hadoop 3.0 and replace it with something correct and
> less
> >> error prone.
> >> Given the fact that this would happen in a major version jump I thought
> >> that would be the right time to do that.
> >>
> >> Niels
> >>
> >>
> >> On Thu, May 29, 2014 at 11:34 AM, Steve Loughran <
> ste...@hortonworks.com>wrote:
> >>
> >> > On 28 May 2014 20:50, Niels Basjes <ni...@basjes.nl> wrote:
> >> >
> >> > > Hi,
> >> > >
> >> > > Last week I ran into this problem again
> >> > > https://issues.apache.org/jira/browse/MAPREDUCE-2094
> >> > >
> >> > > What happens here is that the default implementation of the
> isSplitable
> >> > > method in FileInputFormat is so unsafe that just about everyone who
> >> > > implements a new subclass is likely to get this wrong. The effect of
> >> > > getting this wrong is that all unit tests succeed and running it
> against
> >> > > 'large' input files (>>64MiB) that are compressed using a
> non-splittable
> >> > > compression (often Gzip) will cause the input to be fed into the
> mappers
> >> > > multiple time (i.e. you get garbage results without ever seeing any
> >> > > errors).
> >> > >
> >> > > Last few days I was at Berlin buzzwords talking to someone about
> this bug
> >> > >
> >> >
> >> > that was me, I recall.
> >> >
> >> >
> >> > > and this resulted in the following proposal which I would like your
> >> > > feedback on.
> >> > >
> >> > > 1) This is a change that will break backwards compatibility
> (deliberate
> >> > > choice).
> >> > > 2) The FileInputFormat will get 3 methods (the old isSplitable with
> the
> >> > > typo of one 't' in the name will disappear):
> >> > >     (protected) isSplittableContainer --> true unless compressed
> with
> >> > > non-splittable compression.
> >> > >     (protected) isSplittableContent --> abstract, MUST be
> implemented by
> >> > > the subclass
> >> > >     (public)      isSplittable --> isSplittableContainer &&
> >> > > isSplittableContent
> >> > >
> >> > > The idea is that only the isSplittable is used by other classes to
> know
> >> > if
> >> > > this is a splittable file.
> >> > > The effect I hope to get is that a developer writing their own
> >> > > fileinputformat (which I alone have done twice so far) is 'forced'
> and
> >> > > 'helped' getting this right.
> >> > >
> >> >
> >> > I could see making the attributes more explicit would be good -but
> stopping
> >> > everything that exists from working isn't going to fly.
> >> >
> >> > what about some subclass, AbstractSplittableFileInputFormat that
> implements
> >> > the container properly, requires that content one -and then calculates
> >> > IsSplitable() from the results? Existing code: no change, new formats
> can
> >> > descend from this (and built in ones retrofitted).
> >> >
> >> >
> >> >
> >> > > The reason for me to propose this as an incompatible change is that
> this
> >> > > way I hope to eradicate some of the existing bugs in custom
> >> > implementations
> >> > > 'out there'.
> >> > >
> >> > > P.S. If you agree to this change then I'm willing to put my back
> into it
> >> > > and submit a patch.
> >> > >
> >> > > --
> >> > > Best regards,
> >> > >
> >> > > Niels Basjes
> >> > >
> >> >
> >> > --
> >> > CONFIDENTIALITY NOTICE
> >> > NOTICE: This message is intended for the use of the individual or
> entity to
> >> > which it is addressed and may contain information that is
> confidential,
> >> > privileged and exempt from disclosure under applicable law. If the
> reader
> >> > of this message is not the intended recipient, you are hereby
> notified that
> >> > any printing, copying, dissemination, distribution, disclosure or
> >> > forwarding of this communication is strictly prohibited. If you have
> >> > received this communication in error, please contact the sender
> immediately
> >> > and delete it from your system. Thank You.
> >> >
> >>
> >>
> >>
> >> --
> >> Best regards / Met vriendelijke groeten,
> >>
> >> Niels Basjes
> >
> >
> >
> > --
> >
> > First Option Software Ltd
> > Signal House
> > Jacklyns Lane
> > Alresford
> > SO24 9JJ
> > Tel: +44 (0)1962 738232
> > Mob: +44 (0)7710 160458
> > Fax: +44 (0)1962 600112
> > Web: www.bespokesoftware.com
> >
> > ____________________________________________________
> >
> > This is confidential, non-binding and not company endorsed - see full
> terms at www.fosolutions.co.uk/emailpolicy.html
> > First Option Software Ltd Registered No. 06340261
> > Signal House, Jacklyns Lane, Alresford, Hampshire, SO24 9JJ, U.K.
> > ____________________________________________________
> >
>



-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Reply via email to