Hi,

In many Hadoop production environments you get gzipped files as the raw
input. Usually these are Apache HTTPD logfiles. When putting these gzipped
files into Hadoop you are stuck with exactly 1 map task per input file. In
many scenarios this is fine. However when doing a lot of work in this very
first map task it may very well be advantageous to dividing the work over
multiple tasks, even if there is a penalty for this scaling out.

I've created an addon for Hadoop that makes this possible.

I've reworked the patch I initially created to be included in hadoop (see
HADOOP-7076).
It can now be used by simply adding a jar file to the classpath of an
existing Hadoop installation.

I put the code on github ( https://github.com/nielsbasjes/splittablegzip )
and (for now) the description on my homepage:
http://niels.basjes.nl/splittable-gzip

This feature only works with Hadoop 0.21 and up (I tested it with Cloudera
CDH4b1).
So for now Hadoop 1.x is not yet supported (waiting for HADOOP-7823).

Running "mvn package" automatically generates an RPM on my CentOS system.

Have fun with it an let me know what you think.

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Reply via email to