Just curious - the first thing I did when I started using pig was to
test lzo/gzip/bzip, and lzo, gzip and even low compression bzip all
had tons of processor to spare. I tested native
libs and java stuff, and I could not get CPU bound until I cranked the
compression on bzip2.

Why is gzip considered too CPU intensive? I tested on my machine and
on ec2, I think with the Cloudera ec2 scripts. It seemed the clear
winner. I guess this varies a lot based on cluster configuration,
workload, use of combine, etc?

Russell Jurney http://datasyndrome.com

On May 22, 2012, at 8:58 PM, Jonathan Coveney <jcove...@gmail.com> wrote:

> But you don't capture the nature of the speed benefit of less data going
> over the wire, right? I mean a lot of people use GZip, but in a hadoop
> context, it is considered too CPU intensive, and the gain in speed from
> less data going over the wire isn't enough to counteract that... I'm not
> quite sure how to establish that with other methods. I can quantify the
> cpu/size tradeoff with a microbenchmark, but not how it plays out on the
> network.
>
> 2012/5/22 Bill Graham <billgra...@gmail.com>
>
>> You could also try using a microbech framework to test out various
>> compression techniques in isolation.
>>
>> On Tuesday, May 22, 2012, Jonathan Coveney wrote:
>>
>>> Will do, thanks
>>>
>>> 2012/5/22 Alan Gates <ga...@hortonworks.com <javascript:;>>
>>>
>>>> You might post this same question to mapred-user@hadoop.  I know Owen
>>> and
>>>> Arun have done a lot of analysis of these kinds of things when
>> optimizing
>>>> the terasort.  Others may have valuable feedback there as well.
>>>>
>>>> Alan.
>>>>
>>>> On May 22, 2012, at 12:23 PM, Jonathan Coveney wrote:
>>>>
>>>>> I've been dealing some with the intermediate serialization in Pig,
>> and
>>>> will
>>>>> probably be dealing with it more in the future. When serializing,
>> there
>>>> is
>>>>> generally the time to serialize vs. space on disk tradeoff (an
>> extreme
>>>>> example being compression vs. no compression, a more nuanced one
>> being
>>>>> varint vs full int, that sort of thing). With Hadoop, generally
>> network
>>>> io
>>>>> is the bottleneck, but I'm not sure of the best way to evaluate
>>> something
>>>>> like: method X takes 3x as long to serialize, but is potentially 1/2
>> as
>>>>> large on disk.
>>>>>
>>>>> What are people doing in the wild?
>>>>> Jon
>>>>
>>>>
>>>
>>
>>
>> --
>> Sent from Gmail Mobile
>>

Reply via email to