Hi, In many Hadoop production environments you get gzipped files as the raw input. Usually these are Apache HTTPD logfiles. When putting these gzipped files into Hadoop you are stuck with exactly 1 map task per input file. In many scenarios this is fine. However when doing a lot of work in this very first map task it may very well be advantageous to dividing the work over multiple tasks, even if there is a penalty for this scaling out.
I've created an addon for Hadoop that makes this possible. I've reworked the patch I initially created to be included in hadoop (see HADOOP-7076). It can now be used by simply adding a jar file to the classpath of an existing Hadoop installation. I put the code on github ( https://github.com/nielsbasjes/splittablegzip ) and (for now) the description on my homepage: http://niels.basjes.nl/splittable-gzip This feature only works with Hadoop 0.21 and up (I tested it with Cloudera CDH4b1). So for now Hadoop 1.x is not yet supported (waiting for HADOOP-7823). Running "mvn package" automatically generates an RPM on my CentOS system. Have fun with it an let me know what you think. -- Best regards / Met vriendelijke groeten, Niels Basjes