I forgot to ask a relevant question: What made the original proposed
solution "incompatible"?
To me it still seems to be a clean backward compatible solution that fixes
this issue in a simple way.

Perhaps Todd can explain why?

Niels
On May 29, 2014 2:17 PM, "Niels Basjes" <ni...@basjes.nl> wrote:

> This is exactly why I'm proposing a change that will either 'fix silently'
> (my original patch from 3 years ago) or 'break loudly' (my current
> proposal) old implementations.
> I'm convinced that ther are atleast 100 companies world wide that have a
> custom implementation with this bug and have no clue they have been basing
> descision upon silently corrupted data.
>
>
> On Thu, May 29, 2014 at 1:21 PM, Jay Vyas <jayunit...@gmail.com> wrote:
>
>> I think breaking backwards compat is sensible since It's easily caught by
>> the compiler and  in this case where the alternative is a
>> Runtime error that can result in terabytes of mucked up output.
>>
>> > On May 29, 2014, at 6:11 AM, Matt Fellows <
>> matt.fell...@bespokesoftware.com> wrote:
>> >
>> > As someone who doesn't really contribute, just lurks, I could well be
>> misinformed or under-informed, but I don't see why we can't deprecate a
>> method which could cause dangerous side effects?
>> > People can still use the deprecated methods for backwards
>> compatibility, but are discouraged by compiler warnings, and any changes
>> they write to their code can start to use the new functionality?
>> >
>> > *Apologies if I'm stepping into a Hadoop holy war here
>> >
>> >
>> >> On Thu, May 29, 2014 at 10:47 AM, Niels Basjes <ni...@basjes.nl>
>> wrote:
>> >> My original proposal (from about 3 years ago) was to change the
>> isSplitable
>> >> method to return a safe default ( you can see that in the patch that is
>> >> still attached to that Jira issue).
>> >> For arguments I still do not fully understand this was rejected by
>> Todd and
>> >> Doug.
>> >>
>> >> So that is why my new proposal is to deprecate (remove!) the old method
>> >> with the typo in Hadoop 3.0 and replace it with something correct and
>> less
>> >> error prone.
>> >> Given the fact that this would happen in a major version jump I thought
>> >> that would be the right time to do that.
>> >>
>> >> Niels
>> >>
>> >>
>> >> On Thu, May 29, 2014 at 11:34 AM, Steve Loughran <
>> ste...@hortonworks.com>wrote:
>> >>
>> >> > On 28 May 2014 20:50, Niels Basjes <ni...@basjes.nl> wrote:
>> >> >
>> >> > > Hi,
>> >> > >
>> >> > > Last week I ran into this problem again
>> >> > > https://issues.apache.org/jira/browse/MAPREDUCE-2094
>> >> > >
>> >> > > What happens here is that the default implementation of the
>> isSplitable
>> >> > > method in FileInputFormat is so unsafe that just about everyone who
>> >> > > implements a new subclass is likely to get this wrong. The effect
>> of
>> >> > > getting this wrong is that all unit tests succeed and running it
>> against
>> >> > > 'large' input files (>>64MiB) that are compressed using a
>> non-splittable
>> >> > > compression (often Gzip) will cause the input to be fed into the
>> mappers
>> >> > > multiple time (i.e. you get garbage results without ever seeing any
>> >> > > errors).
>> >> > >
>> >> > > Last few days I was at Berlin buzzwords talking to someone about
>> this bug
>> >> > >
>> >> >
>> >> > that was me, I recall.
>> >> >
>> >> >
>> >> > > and this resulted in the following proposal which I would like your
>> >> > > feedback on.
>> >> > >
>> >> > > 1) This is a change that will break backwards compatibility
>> (deliberate
>> >> > > choice).
>> >> > > 2) The FileInputFormat will get 3 methods (the old isSplitable
>> with the
>> >> > > typo of one 't' in the name will disappear):
>> >> > >     (protected) isSplittableContainer --> true unless compressed
>> with
>> >> > > non-splittable compression.
>> >> > >     (protected) isSplittableContent --> abstract, MUST be
>> implemented by
>> >> > > the subclass
>> >> > >     (public)      isSplittable --> isSplittableContainer &&
>> >> > > isSplittableContent
>> >> > >
>> >> > > The idea is that only the isSplittable is used by other classes to
>> know
>> >> > if
>> >> > > this is a splittable file.
>> >> > > The effect I hope to get is that a developer writing their own
>> >> > > fileinputformat (which I alone have done twice so far) is 'forced'
>> and
>> >> > > 'helped' getting this right.
>> >> > >
>> >> >
>> >> > I could see making the attributes more explicit would be good -but
>> stopping
>> >> > everything that exists from working isn't going to fly.
>> >> >
>> >> > what about some subclass, AbstractSplittableFileInputFormat that
>> implements
>> >> > the container properly, requires that content one -and then
>> calculates
>> >> > IsSplitable() from the results? Existing code: no change, new
>> formats can
>> >> > descend from this (and built in ones retrofitted).
>> >> >
>> >> >
>> >> >
>> >> > > The reason for me to propose this as an incompatible change is
>> that this
>> >> > > way I hope to eradicate some of the existing bugs in custom
>> >> > implementations
>> >> > > 'out there'.
>> >> > >
>> >> > > P.S. If you agree to this change then I'm willing to put my back
>> into it
>> >> > > and submit a patch.
>> >> > >
>> >> > > --
>> >> > > Best regards,
>> >> > >
>> >> > > Niels Basjes
>> >> > >
>> >> >
>> >> > --
>> >> > CONFIDENTIALITY NOTICE
>> >> > NOTICE: This message is intended for the use of the individual or
>> entity to
>> >> > which it is addressed and may contain information that is
>> confidential,
>> >> > privileged and exempt from disclosure under applicable law. If the
>> reader
>> >> > of this message is not the intended recipient, you are hereby
>> notified that
>> >> > any printing, copying, dissemination, distribution, disclosure or
>> >> > forwarding of this communication is strictly prohibited. If you have
>> >> > received this communication in error, please contact the sender
>> immediately
>> >> > and delete it from your system. Thank You.
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Best regards / Met vriendelijke groeten,
>> >>
>> >> Niels Basjes
>> >
>> >
>> >
>> > --
>> >
>> > First Option Software Ltd
>> > Signal House
>> > Jacklyns Lane
>> > Alresford
>> > SO24 9JJ
>> > Tel: +44 (0)1962 738232
>> > Mob: +44 (0)7710 160458
>> > Fax: +44 (0)1962 600112
>> > Web: www.bespokesoftware.com
>> >
>> > ____________________________________________________
>> >
>> > This is confidential, non-binding and not company endorsed - see full
>> terms at www.fosolutions.co.uk/emailpolicy.html
>> > First Option Software Ltd Registered No. 06340261
>> > Signal House, Jacklyns Lane, Alresford, Hampshire, SO24 9JJ, U.K.
>> > ____________________________________________________
>> >
>>
>
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes
>

Reply via email to