Hi, The last few weeks we built an application using Hadoop. Because we're implementing against special logfiles (line oriented, textual and gzipped) and we wanted to extract specific fields from those file before putting it into our mapper. We chose to implement our own derivative of the FileInputFormat class to do this.
All went well until we tried it with "big" (100MiB and bigger) files. We then noticed that a lot of the values became doubled, and when we put in full size production data for a test run we found "single events" to be counted as "36". It took a while to figure out what went wrong but essentially the root cause was the fact that the isSplittable method returned true for Gzipped files (which aren't splittable). The implementation that was used was the one in FileInputFormat. The documentation for this method states "Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be.". This documentation gave us the illusion that the default FileInputFormat implementation would handle compression correctly. Because it all worked with small Gzipped files we never expected the real implementation of this method to be "return true;" Because of this "true" value somehow the framework decided to read each input file fully the number of times it wanted to split it. With really messy effects in our case. The derived TextInputFormat class does have a compression aware implementation of isSplittable. Given my current knowledge of Hadoop; I would have chosen to let the default isSplittable implementation (i.e. the one in FileInputFormat) be either "safe" (always return false) or "correct" (return whatever is right for the applicable compression). The latter would make the implementation match what I would expect from the documentation. I would like to understand the logic behind the current implementation choice in relation to what I expected (mainly from the documentation). Thanks for explaining. -- Best regards, Niels Basjes