Re: Should splittable Gzip be a "core" hadoop feature?

Michel Segel Wed, 29 Feb 2012 04:10:50 -0800

Let's play devil's advocate for a second?

Why? Snappy exists.
The only advantage is that you don't have to convert from gzip to snappy and 
can process gzip files natively.


Next question is how large are the gzip files in the first place?

I don't disagree, I just want to have a solid argument in favor of it...




Sent from a remote device. Please excuse any typos...

Mike Segel

On Feb 28, 2012, at 9:50 AM, Niels Basjes <ni...@basjes.nl> wrote:

> Hi,
> 
> Some time ago I had an idea and implemented it.
> 
> Normally you can only run a single gzipped input file through a single
> mapper and thus only on a single CPU core.
> What I created makes it possible to process a Gzipped file in such a way
> that it can run on several mappers in parallel.
> 
> I've put the javadoc I created on my homepage so you can read more about
> the details.
> http://howto.basjes.nl/hadoop/javadoc-for-skipseeksplittablegzipcodec
> 
> Now the question that was raised by one of the people reviewing this code
> was: Should this implementation be part of the core Hadoop feature set?
> The main reason that was given is that this needs a bit more understanding
> on what is happening and as such cannot be enabled by default.
> 
> I would like to hear from the Hadoop Core/Map reduce users what you think.
> 
> Should this be
> - a part of the default Hadoop feature set so that anyone can simply enable
> it by setting the right configuration?
> - a separate library?
> - a nice idea I had fun building but that no one needs?
> - ... ?
> 
> -- 
> Best regards / Met vriendelijke groeten,
> 
> Niels Basjes

Re: Should splittable Gzip be a "core" hadoop feature?

Reply via email to