But you don't capture the nature of the speed benefit of less data going
over the wire, right? I mean a lot of people use GZip, but in a hadoop
context, it is considered too CPU intensive, and the gain in speed from
less data going over the wire isn't enough to counteract that... I'm not
quite sure how to establish that with other methods. I can quantify the
cpu/size tradeoff with a microbenchmark, but not how it plays out on the
network.

2012/5/22 Bill Graham <billgra...@gmail.com>

> You could also try using a microbech framework to test out various
> compression techniques in isolation.
>
> On Tuesday, May 22, 2012, Jonathan Coveney wrote:
>
> > Will do, thanks
> >
> > 2012/5/22 Alan Gates <ga...@hortonworks.com <javascript:;>>
> >
> > > You might post this same question to mapred-user@hadoop.  I know Owen
> > and
> > > Arun have done a lot of analysis of these kinds of things when
> optimizing
> > > the terasort.  Others may have valuable feedback there as well.
> > >
> > > Alan.
> > >
> > > On May 22, 2012, at 12:23 PM, Jonathan Coveney wrote:
> > >
> > > > I've been dealing some with the intermediate serialization in Pig,
> and
> > > will
> > > > probably be dealing with it more in the future. When serializing,
> there
> > > is
> > > > generally the time to serialize vs. space on disk tradeoff (an
> extreme
> > > > example being compression vs. no compression, a more nuanced one
> being
> > > > varint vs full int, that sort of thing). With Hadoop, generally
> network
> > > io
> > > > is the bottleneck, but I'm not sure of the best way to evaluate
> > something
> > > > like: method X takes 3x as long to serialize, but is potentially 1/2
> as
> > > > large on disk.
> > > >
> > > > What are people doing in the wild?
> > > > Jon
> > >
> > >
> >
>
>
> --
> Sent from Gmail Mobile
>

Reply via email to