Hi, Some time ago I had an idea and implemented it.
Normally you can only run a single gzipped input file through a single mapper and thus only on a single CPU core. What I created makes it possible to process a Gzipped file in such a way that it can run on several mappers in parallel. I've put the javadoc I created on my homepage so you can read more about the details. http://howto.basjes.nl/hadoop/javadoc-for-skipseeksplittablegzipcodec Now the question that was raised by one of the people reviewing this code was: Should this implementation be part of the core Hadoop feature set? The main reason that was given is that this needs a bit more understanding on what is happening and as such cannot be enabled by default. I would like to hear from the Hadoop Core/Map reduce users what you think. Should this be - a part of the default Hadoop feature set so that anyone can simply enable it by setting the right configuration? - a separate library? - a nice idea I had fun building but that no one needs? - ... ? -- Best regards / Met vriendelijke groeten, Niels Basjes