On Wed, Jun 11, 2014 at 1:35 AM, Niels Basjes <ni...@basjes.nl> wrote:
> That's not what I meant. What I understood from what was described is that
> sometimes people use an existing file extension (like .gz) for a file that
> is not a gzipped file.

Understood, but this change also applies to other loaded codecs, like
.lzo, .bz, etc. Adding a new codec changes the default behavior for
all InputFormats that don't override this method.

> I consider "silently producing garbage" one of the worst kinds of problem
> to tackle.
> Because many custom file based input formats have stumbled (getting
> "silently produced garbage") over the current isSplitable implementation I
> really want to avoid any more of this in the future.
> That is why I want to change the implementations in this area of Hadoop in
> such a way that this "silently producing garbage" effect is taken out.

Adding validity assumptions to a common base class will affect a lot
of users, most of whom are not InputFormat authors.

> So the question remains: What is the way this should be changed?
> I'm willing to build it and submit a patch.

Would a logged warning suffice? This would aid debugging without an
incompatible change in behavior. It could also be disabled easily. -C

>> > The safest way would be either 2 or 4. Solution 3 would effectively be
>> the
>> > same as the current implementation, yet it would catch the problem
>> > situations as long as people stick to normal file name conventions.
>> > Solution 3 would also allow removing some code duplication in several
>> > subclasses.
>> >
>> > I would go for solution 3.
>> >
>> > Niels Basjes
>>
>
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes

Reply via email to