Yslow does some clever correlation-based optimizations to achieve significant speedups. They have a good paper about it: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf Note the Hive/Pig numbers.. we are generating unnecessary jobs, and too much intermediate data, it seems (not sure which version of Pig they ran).
D