Making gzip splittable for Hadoop

Niels Basjes Fri, 30 Mar 2012 07:07:59 -0700

Hi,

In many Hadoop production environments you get gzipped files as the raw
input. Usually these are Apache HTTPD logfiles. When putting these gzipped
files into Hadoop you are stuck with exactly 1 map task per input file. In
many scenarios this is fine. However when doing a lot of work in this very
first map task it may very well be advantageous to dividing the work over
multiple tasks, even if there is a penalty for this scaling out.


I've created an addon for Hadoop that makes this possible.

I've reworked the patch I initially created to be included in hadoop (see
HADOOP-7076).
It can now be used by simply adding a jar file to the classpath of an
existing Hadoop installation.

I put the code on github ( https://github.com/nielsbasjes/splittablegzip )
and (for now) the description on my homepage:
http://niels.basjes.nl/splittable-gzip

This feature only works with Hadoop 0.21 and up (I tested it with Cloudera
CDH4b1).
So for now Hadoop 1.x is not yet supported (waiting for HADOOP-7823).

Running "mvn package" automatically generates an RPM on my CentOS system.

Have fun with it an let me know what you think.

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Making gzip splittable for Hadoop

Reply via email to