Should splittable Gzip be a "core" hadoop feature?

Niels Basjes Tue, 28 Feb 2012 07:50:58 -0800

Hi,

Some time ago I had an idea and implemented it.


Normally you can only run a single gzipped input file through a single
mapper and thus only on a single CPU core.
What I created makes it possible to process a Gzipped file in such a way
that it can run on several mappers in parallel.

I've put the javadoc I created on my homepage so you can read more about
the details.
http://howto.basjes.nl/hadoop/javadoc-for-skipseeksplittablegzipcodec

Now the question that was raised by one of the people reviewing this code
was: Should this implementation be part of the core Hadoop feature set?
The main reason that was given is that this needs a bit more understanding
on what is happening and as such cannot be enabled by default.

I would like to hear from the Hadoop Core/Map reduce users what you think.

Should this be
- a part of the default Hadoop feature set so that anyone can simply enable
it by setting the right configuration?
- a separate library?
- a nice idea I had fun building but that no one needs?
- ... ?

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Should splittable Gzip be a "core" hadoop feature?

Reply via email to