1) I think this is a query optimizer thing to do (filter pushing). In fact,
the operation will be :-
b = group a by $0;
c = filter b by group != 'fred';
becomes
before_b = filter a by $0 != 'fred'
b = group a by $0;
c = b ;
so it can be done in map stage, before combiner.
2a) According the new combiner behavior discussed in PIG-274, seems like we
will have to
- call initial function at the end of map stage.
- call intermediate function as combiner (because it is optional)
- call final in reduce
Pi
On Wed, Jul 16, 2008 at 10:41 AM, Alan Gates <[EMAIL PROTECTED]> wrote:
> One of the changes we want to make in the new pig pipeline is to make much
> more aggressive use of the combiner. In thinking through when we should use
> the combiner, I came up with the following. The list is not exhaustive, but
> it includes common expressions and should be possible to implement within a
> week or so. If you can think of other rules that fit this profile, please
> suggest them.
>
> 1) Filters.
> a) If the predicate does not operate on any of the bags (that is, it
> only operates on the grouping key) then the filter will be relocated
> to the combiner phase. For example:
>
> b = group a by $0;
> c = filter b by group != 'fred';
> ...
>
> In this case subsequent operations to the filter could also be
> considered for pushing into the combiner.
>
> b) If it operates on the bags with an algebraic function, then a
> foreach with the initial function will be placed in the combiner
> phase and the filter in the reduce phase will be changed to use the
> final function. For example:
>
> b = group a by $0;
> c = filter b by count(a) > 0;
> ...
>
> 2) Foreach.
> a) If the foreach does not contain a nested plan and all UDFs in the
> generate statement are algebraic, then the foreach will be copied
> and placed in the combiner phase. The version of the foreach in the
> combiner stage will use the initial function, and the version in the
> reduce stage will be changed to use the final function. For
> example:
>
> b = group a by $0;
> c = foreach b generate group, group + 5, sum(a.$1);
>
> b) If the foreach has an inner plan that has a distinct and no
> filters, then it will be left as is in the reduce plan and a
> combiner plan will be created that runs the inner plan minus the
> generate on the tuples, thus creating the distinct portion of the
> data without applying the UDF. For example:
>
> b = group a by $0;
> c = foreach b {
> c1 = distinct $1;
> generate group, COUNT(c1);
> }
>
> 3) Distinct. This will be converted to apply the distinct in the combiner
> as well as in the reducer.
>
> 4) Limit. The limit will be applied in the combiner and again in the
> reducer.
>
> Alan.
>