Re: Should splittable Gzip be a core hadoop feature?

2012-03-01 Thread Michel Segel
I do agree that a git hub project is the way to go unless you could convince Cloudera, HortonWorks or MapR to pick it up and support it. They have enough committers Is this potentially worthwhile? Maybe, it depends on how the cluster is integrated in to the overall environment. Companies

Re: Should splittable Gzip be a core hadoop feature?

2012-02-29 Thread Michel Segel
Let's play devil's advocate for a second? Why? Snappy exists. The only advantage is that you don't have to convert from gzip to snappy and can process gzip files natively. Next question is how large are the gzip files in the first place? I don't disagree, I just want to have a solid argument

Re: Should splittable Gzip be a core hadoop feature?

2012-02-29 Thread Edward Capriolo
Mike, Snappy is cool and all, but I was not overly impressed with it. GZ zipps much better then Snappy. Last time I checked for our log file gzip took them down from 100MB- 40MB, while snappy compressed them from 100MB-55MB. That was only with sequence files. But still that is pretty significant

Re: Should splittable Gzip be a core hadoop feature?

2012-02-29 Thread Niels Basjes
Hi, On Wed, Feb 29, 2012 at 13:10, Michel Segel michael_se...@hotmail.comwrote: Let's play devil's advocate for a second? I always like that :) Why? Because then datafiles from other systems (like the Apache HTTP webserver) can be processed without preprocessing more efficiently. Snappy

Re: Should splittable Gzip be a core hadoop feature?

2012-02-29 Thread Niels Basjes
Hi, On Wed, Feb 29, 2012 at 16:52, Edward Capriolo edlinuxg...@gmail.comwrote: ... But being able to generate split info for them and processing them would be good as well. I remember that was a hot thing to do with lzo back in the day. The pain of once overing the gz files to generate the

Re: Should splittable Gzip be a core hadoop feature?

2012-02-29 Thread Robert Evans
I can see a use for it, but I have two concerns about it. My biggest concern is maintainability. We have had lots of things get thrown into contrib in the past, very few people use them, and inevitably they start to suffer from bit rot. I am not saying that it will happen with this, but if

Re: Should splittable Gzip be a core hadoop feature?

2012-02-29 Thread Robert Evans
If many people are going to use it then by all means put it in. If there is only one person, or a very small handful of people that are going to use it then I personally would prefer to see it a separate project. However, Edward, you have convinced me that I am trying to make a logical

Should splittable Gzip be a core hadoop feature?

2012-02-28 Thread Niels Basjes
Hi, Some time ago I had an idea and implemented it. Normally you can only run a single gzipped input file through a single mapper and thus only on a single CPU core. What I created makes it possible to process a Gzipped file in such a way that it can run on several mappers in parallel. I've put