Re: Should splittable Gzip be a "core" hadoop feature?

Robert Evans Wed, 29 Feb 2012 08:33:09 -0800

I can see a use for it, but I have two concerns about it.  My biggest concern 
is maintainability.  We have had lots of things get thrown into contrib in the 
past, very few people use them, and inevitably they start to suffer from bit 
rot.  I am not saying that it will happen with this, but if you have to ask if 
people will use it and there has been no overwhelming yes, it makes me nervous 
about it.  My second concern is with knowing when to use this.  Anything that 
adds this in would have to come with plenty of documentation about how it 
works, how it is different from the normal gzip format, explanations about what 
type of a load it might put on data nodes that hold the start of the file, etc.


>From both of these I would prefer to see this as a github project for a while 
>first, and one it shows that it has a significant following, or a community 
>with it, then we can pull it in.  But if others disagree I am not going to 
>block it.  I am a -0 on pulling this in now.

--Bobby

On 2/29/12 10:00 AM, "Niels Basjes" <ni...@basjes.nl> wrote:

Hi,

On Wed, Feb 29, 2012 at 16:52, Edward Capriolo <edlinuxg...@gmail.com>wrote:
...

> But being able to generate split info for them and processing them
> would be good as well. I remember that was a hot thing to do with lzo
> back in the day. The pain of once overing the gz files to generate the
> split info is detracting but it is nice to know it is there if you
> want it.
>

Note that the solution I created (HADOOP-7076) does not require any
preprocessing.
It can split ANY gzipped file as-is.
The downside is that this effectively costs some additional performance
because the task has to decompress the first part of the file that is to be
discarded.

The other two ways of splitting gzipped files either require
- creating come kind of "compression index" before actually using the file
(HADOOP-6153)
- creating a file in a format that is gerenated in such a way that it is
really a set of concatenated gzipped files. (HADOOP-7909)

--
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Should splittable Gzip be a "core" hadoop feature?

Reply via email to