Re: Making use of the combiner in the new pig pipeline

pi song Wed, 16 Jul 2008 07:43:25 -0700

1) I think this is a query optimizer thing to do (filter pushing). In fact,
the operation will be :-


b = group a by $0;
c = filter b by group != 'fred';

           becomes

before_b = filter a by $0 != 'fred'
b = group a by $0;
c = b ;

so it can be done in map stage, before combiner.

2a) According the new combiner behavior discussed in PIG-274, seems like we
will have to
   - call initial function at the end of map stage.
   - call intermediate function as combiner (because it is optional)
   - call final in reduce

Pi

On Wed, Jul 16, 2008 at 10:41 AM, Alan Gates <[EMAIL PROTECTED]> wrote:

> One of the changes we want to make in the new pig pipeline is to make much
> more aggressive use of the combiner.  In thinking through when we should use
> the combiner, I came up with the following.  The list is not exhaustive, but
> it includes common expressions and should be possible to implement within a
> week or so.  If you can think of other rules that fit this profile, please
> suggest them.
>
> 1) Filters.
>   a) If the predicate does not operate on any of the bags (that is, it
>   only operates on the grouping key) then the filter will be relocated
>   to the combiner phase.  For example:
>
>   b = group a by $0;
>   c = filter b by group != 'fred';
>   ...
>
>   In this case subsequent operations to the filter could also be
>   considered for pushing into the combiner.
>
>   b) If it operates on the bags with an algebraic function, then a
>   foreach with the initial function will be placed in the combiner
>   phase and the filter in the reduce phase will be changed to use the
>   final function.  For example:
>
>   b = group a by $0;
>   c = filter b by count(a) > 0;
>   ...
>
> 2) Foreach.
>   a) If the foreach does not contain a nested plan and all UDFs in the
>   generate statement are algebraic, then the foreach will be copied
>   and placed in the combiner phase.  The version of the foreach in the
>   combiner stage will use the initial function, and the version in the
>   reduce stage     will be changed to use the final function.  For
>   example:
>
>   b = group a by $0;
>   c = foreach b generate group, group + 5, sum(a.$1);
>
>   b) If the foreach has an inner plan that has a distinct and no
>   filters, then it will be left as is in the reduce plan and a
>   combiner plan will be created that runs the inner plan minus the
>   generate on the tuples, thus creating the distinct portion of the
>   data without applying the UDF.  For example:
>
>   b = group a by $0;
>   c = foreach b {
>       c1 = distinct $1;
>       generate group, COUNT(c1);
>   }
>
> 3) Distinct.  This will be converted to apply the distinct in the combiner
> as well as in the reducer.
>
> 4) Limit.  The limit will be applied in the combiner and again in the
> reducer.
>
> Alan.
>

Re: Making use of the combiner in the new pig pipeline

Reply via email to