Is there a good benchmark to evaluate the CPU time/space tradeoff in the shuffle stage of hadoop?

Jonathan Coveney Tue, 22 May 2012 12:24:23 -0700

I've been dealing some with the intermediate serialization in Pig, and will
probably be dealing with it more in the future. When serializing, there is
generally the time to serialize vs. space on disk tradeoff (an extreme
example being compression vs. no compression, a more nuanced one being
varint vs full int, that sort of thing). With Hadoop, generally network io
is the bottleneck, but I'm not sure of the best way to evaluate something
like: method X takes 3x as long to serialize, but is potentially 1/2 as
large on disk.


What are people doing in the wild?
Jon

Is there a good benchmark to evaluate the CPU time/space tradeoff in the shuffle stage of hadoop?

Reply via email to