The correlation based optimization in YSmart looks good as it creates minimal number of jobs by exploiting correlation between the multiple jobs. In the experiment section it is mentioned that they used CDH distribution for their experimental setup. Since the paper is published in ICDCS 2011 in June, a quick glance over CDH3 beta 4 (released in Feb 2011) release history shows Pig 0.8.0. Looks like they have patched this in Hive http://code.google.com/p/ysmart/wiki/HivePatchhttp://code.google.com/p/ysmart/wiki/HivePatch
On Mar 10, 2012, at 11:16 PM, Dmitriy Ryaboy wrote: > Yslow does some clever correlation-based optimizations to achieve > significant speedups. They have a good paper about it: > http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf > Note the Hive/Pig numbers.. we are generating unnecessary jobs, and > too much intermediate data, it seems (not sure which version of Pig > they ran). > > D