You could also try using a microbech framework to test out various compression techniques in isolation.
On Tuesday, May 22, 2012, Jonathan Coveney wrote: > Will do, thanks > > 2012/5/22 Alan Gates <ga...@hortonworks.com <javascript:;>> > > > You might post this same question to mapred-user@hadoop. I know Owen > and > > Arun have done a lot of analysis of these kinds of things when optimizing > > the terasort. Others may have valuable feedback there as well. > > > > Alan. > > > > On May 22, 2012, at 12:23 PM, Jonathan Coveney wrote: > > > > > I've been dealing some with the intermediate serialization in Pig, and > > will > > > probably be dealing with it more in the future. When serializing, there > > is > > > generally the time to serialize vs. space on disk tradeoff (an extreme > > > example being compression vs. no compression, a more nuanced one being > > > varint vs full int, that sort of thing). With Hadoop, generally network > > io > > > is the bottleneck, but I'm not sure of the best way to evaluate > something > > > like: method X takes 3x as long to serialize, but is potentially 1/2 as > > > large on disk. > > > > > > What are people doing in the wild? > > > Jon > > > > > -- Sent from Gmail Mobile