Just curious - the first thing I did when I started using pig was to test lzo/gzip/bzip, and lzo, gzip and even low compression bzip all had tons of processor to spare. I tested native libs and java stuff, and I could not get CPU bound until I cranked the compression on bzip2.
Why is gzip considered too CPU intensive? I tested on my machine and on ec2, I think with the Cloudera ec2 scripts. It seemed the clear winner. I guess this varies a lot based on cluster configuration, workload, use of combine, etc? Russell Jurney http://datasyndrome.com On May 22, 2012, at 8:58 PM, Jonathan Coveney <jcove...@gmail.com> wrote: > But you don't capture the nature of the speed benefit of less data going > over the wire, right? I mean a lot of people use GZip, but in a hadoop > context, it is considered too CPU intensive, and the gain in speed from > less data going over the wire isn't enough to counteract that... I'm not > quite sure how to establish that with other methods. I can quantify the > cpu/size tradeoff with a microbenchmark, but not how it plays out on the > network. > > 2012/5/22 Bill Graham <billgra...@gmail.com> > >> You could also try using a microbech framework to test out various >> compression techniques in isolation. >> >> On Tuesday, May 22, 2012, Jonathan Coveney wrote: >> >>> Will do, thanks >>> >>> 2012/5/22 Alan Gates <ga...@hortonworks.com <javascript:;>> >>> >>>> You might post this same question to mapred-user@hadoop. I know Owen >>> and >>>> Arun have done a lot of analysis of these kinds of things when >> optimizing >>>> the terasort. Others may have valuable feedback there as well. >>>> >>>> Alan. >>>> >>>> On May 22, 2012, at 12:23 PM, Jonathan Coveney wrote: >>>> >>>>> I've been dealing some with the intermediate serialization in Pig, >> and >>>> will >>>>> probably be dealing with it more in the future. When serializing, >> there >>>> is >>>>> generally the time to serialize vs. space on disk tradeoff (an >> extreme >>>>> example being compression vs. no compression, a more nuanced one >> being >>>>> varint vs full int, that sort of thing). With Hadoop, generally >> network >>>> io >>>>> is the bottleneck, but I'm not sure of the best way to evaluate >>> something >>>>> like: method X takes 3x as long to serialize, but is potentially 1/2 >> as >>>>> large on disk. >>>>> >>>>> What are people doing in the wild? >>>>> Jon >>>> >>>> >>> >> >> >> -- >> Sent from Gmail Mobile >>