On Fri, Jun 6, 2014 at 4:03 PM, Niels Basjes <ni...@basjes.nl> wrote: > and if you then give the file the .gz extension this breaks all common > sense / conventions about file names.
That the suffix for all compression codecs in every context- and all future codecs- should determine whether a file can be split is not an assumption we can make safely. Again, that's not an assumption that held when people built their current systems, and they would be justly annoyed with the project for changing it. > I hold "correct data" much higher than performance and scalability; so the > performance impact is a concern but it is much less important than the list > of bugs we are facing right now. These are not bugs. NLineInputFormat doesn't support compressed input, and why would it? -C > The safest way would be either 2 or 4. Solution 3 would effectively be the > same as the current implementation, yet it would catch the problem > situations as long as people stick to normal file name conventions. > Solution 3 would also allow removing some code duplication in several > subclasses. > > I would go for solution 3. > > Niels Basjes