One of the changes we want to make in the new pig pipeline is to make
much more aggressive use of the combiner. In thinking through when we
should use the combiner, I came up with the following. The list is not
exhaustive, but it includes common expressions and should be possible to
implement within a week or so. If you can think of other rules that fit
this profile, please suggest them.
1) Filters.
a) If the predicate does not operate on any of the bags (that is, it
only operates on the grouping key) then the filter will be relocated
to the combiner phase. For example:
b = group a by $0;
c = filter b by group != 'fred';
...
In this case subsequent operations to the filter could also be
considered for pushing into the combiner.
b) If it operates on the bags with an algebraic function, then a
foreach with the initial function will be placed in the combiner
phase and the filter in the reduce phase will be changed to use the
final function. For example:
b = group a by $0;
c = filter b by count(a) > 0;
...
2) Foreach.
a) If the foreach does not contain a nested plan and all UDFs in the
generate statement are algebraic, then the foreach will be copied
and placed in the combiner phase. The version of the foreach in the
combiner stage will use the initial function, and the version in the
reduce stage will be changed to use the final function. For
example:
b = group a by $0;
c = foreach b generate group, group + 5, sum(a.$1);
b) If the foreach has an inner plan that has a distinct and no
filters, then it will be left as is in the reduce plan and a
combiner plan will be created that runs the inner plan minus the
generate on the tuples, thus creating the distinct portion of the
data without applying the UDF. For example:
b = group a by $0;
c = foreach b {
c1 = distinct $1;
generate group, COUNT(c1);
}
3) Distinct. This will be converted to apply the distinct in the
combiner as well as in the reducer.
4) Limit. The limit will be applied in the combiner and again in the
reducer.
Alan.