On Wed, Jun 11, 2014 at 1:35 AM, Niels Basjes <ni...@basjes.nl> wrote: > That's not what I meant. What I understood from what was described is that > sometimes people use an existing file extension (like .gz) for a file that > is not a gzipped file.
Understood, but this change also applies to other loaded codecs, like .lzo, .bz, etc. Adding a new codec changes the default behavior for all InputFormats that don't override this method. > I consider "silently producing garbage" one of the worst kinds of problem > to tackle. > Because many custom file based input formats have stumbled (getting > "silently produced garbage") over the current isSplitable implementation I > really want to avoid any more of this in the future. > That is why I want to change the implementations in this area of Hadoop in > such a way that this "silently producing garbage" effect is taken out. Adding validity assumptions to a common base class will affect a lot of users, most of whom are not InputFormat authors. > So the question remains: What is the way this should be changed? > I'm willing to build it and submit a patch. Would a logged warning suffice? This would aid debugging without an incompatible change in behavior. It could also be disabled easily. -C >> > The safest way would be either 2 or 4. Solution 3 would effectively be >> the >> > same as the current implementation, yet it would catch the problem >> > situations as long as people stick to normal file name conventions. >> > Solution 3 would also allow removing some code duplication in several >> > subclasses. >> > >> > I would go for solution 3. >> > >> > Niels Basjes >> > > > > -- > Best regards / Met vriendelijke groeten, > > Niels Basjes