Hi,

On Wed, Jun 11, 2014 at 8:25 PM, Chris Douglas <cdoug...@apache.org> wrote:

> On Wed, Jun 11, 2014 at 1:35 AM, Niels Basjes <ni...@basjes.nl> wrote:
> > That's not what I meant. What I understood from what was described is
> that
> > sometimes people use an existing file extension (like .gz) for a file
> that
> > is not a gzipped file.
>


> Understood, but this change also applies to other loaded codecs, like
> .lzo, .bz, etc. Adding a new codec changes the default behavior for
> all InputFormats that don't override this method.
>

Yes it would. I think that forcing the developer of the file based
inputformat to implement this would be the best way to go.
Making this method abstract is the first thing that spring to mind.

This would break backwards compatibility so I think we can only do that
with the 3.0.0 version


> > I consider "silently producing garbage" one of the worst kinds of problem
> > to tackle.
> > Because many custom file based input formats have stumbled (getting
> > "silently produced garbage") over the current isSplitable implementation
> I
> > really want to avoid any more of this in the future.
> > That is why I want to change the implementations in this area of Hadoop
> in
> > such a way that this "silently producing garbage" effect is taken out.
>
> Adding validity assumptions to a common base class will affect a lot
> of users, most of whom are not InputFormat authors.
>

True, the thing is that if a user uses an InputFormat written by someone
else and then it "silently produces garbage" they are also affected in a
much worse way.


> > So the question remains: What is the way this should be changed?
> > I'm willing to build it and submit a patch.
>
> Would a logged warning suffice? This would aid debugging without an
> incompatible change in behavior. It could also be disabled easily. -C


Hmmm, people only look at logs when they have a problem. So I don't think
this would be enough.

Perhaps this makes sense:
- For 3.0: Shout at the developer who does it wrong (i.e. make it abstract
and force them to think about this) i.e. Create new abstract method
isSplittable (tt) in FileInputFormat, remove isSplitable (one t).

To avoid needless code duplication (which we already have in the codebase)
create a helper method something like 'fileNameIndicatesSplittableFile' (
returns enum:  Splittable/NonSplittable/Unknown ).

- For 2.x: Keep the enduser safe: Avoid "silently producing garbage" in all
situations where the developer already did it wrong. (i.e. change
isSplitable ==> return false) This costs performance only in those
situations where the developer actually did it wrong (i.e. they didn't
thing this through)

How about that?

P.S. I created an issue for the NLineInputFormat problem I found:
https://issues.apache.org/jira/browse/MAPREDUCE-5925

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Reply via email to