I could be missing something, but couldn't you just deprecate isSplitable
(spelled incorrectly) and create a new isSplittable as described?


On Thu, May 29, 2014 at 10:34 AM, Steve Loughran <ste...@hortonworks.com>
wrote:

> On 28 May 2014 20:50, Niels Basjes <ni...@basjes.nl> wrote:
>
> > Hi,
> >
> > Last week I ran into this problem again
> > https://issues.apache.org/jira/browse/MAPREDUCE-2094
> >
> > What happens here is that the default implementation of the isSplitable
> > method in FileInputFormat is so unsafe that just about everyone who
> > implements a new subclass is likely to get this wrong. The effect of
> > getting this wrong is that all unit tests succeed and running it against
> > 'large' input files (>>64MiB) that are compressed using a non-splittable
> > compression (often Gzip) will cause the input to be fed into the mappers
> > multiple time (i.e. you get garbage results without ever seeing any
> > errors).
> >
> > Last few days I was at Berlin buzzwords talking to someone about this bug
> >
>
> that was me, I recall.
>
>
> > and this resulted in the following proposal which I would like your
> > feedback on.
> >
> > 1) This is a change that will break backwards compatibility (deliberate
> > choice).
> > 2) The FileInputFormat will get 3 methods (the old isSplitable with the
> > typo of one 't' in the name will disappear):
> >     (protected) isSplittableContainer --> true unless compressed with
> > non-splittable compression.
> >     (protected) isSplittableContent --> abstract, MUST be implemented by
> > the subclass
> >     (public)      isSplittable --> isSplittableContainer &&
> > isSplittableContent
> >
> > The idea is that only the isSplittable is used by other classes to know
> if
> > this is a splittable file.
> > The effect I hope to get is that a developer writing their own
> > fileinputformat (which I alone have done twice so far) is 'forced' and
> > 'helped' getting this right.
> >
>
> I could see making the attributes more explicit would be good -but stopping
> everything that exists from working isn't going to fly.
>
> what about some subclass, AbstractSplittableFileInputFormat that implements
> the container properly, requires that content one -and then calculates
> IsSplitable() from the results? Existing code: no change, new formats can
> descend from this (and built in ones retrofitted).
>
>
>
> > The reason for me to propose this as an incompatible change is that this
> > way I hope to eradicate some of the existing bugs in custom
> implementations
> > 'out there'.
> >
> > P.S. If you agree to this change then I'm willing to put my back into it
> > and submit a patch.
> >
> > --
> > Best regards,
> >
> > Niels Basjes
> >
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>



-- 
[image: cid:1CBF4038-3F0F-4FC2-A1FF-6DC81B8B6F94]
 First Option Software Ltd
Signal House
Jacklyns Lane
Alresford
SO24 9JJ
Tel: +44 (0)1962 738232
Mob: +44 (0)7710 160458
Fax: +44 (0)1962 600112
Web: www.b <http://www.fosolutions.co.uk/>espokesoftware.com
<http://bespokesoftware.com/>

-- 
____________________________________________________

This is confidential, non-binding and not company endorsed - see full terms 
at www.fosolutions.co.uk/emailpolicy.html 
First Option Software Ltd Registered No. 06340261
Signal House, Jacklyns Lane, Alresford, Hampshire, SO24 9JJ, U.K.
____________________________________________________

Reply via email to