[ 
https://issues.apache.org/jira/browse/PIG-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1790:
--------------------------------

    Assignee: Corinne Chandel  (was: Olga Natkovich)

The following information should be added to the cookbook:

Make Sure Combiner is Used
 
Whenever possible make sure that combiner is used as it frequently yields order 
of magnitude improvement in performance. Combiner is generally used in case of 
non-nested foreach where all projections are either expressions on the group 
column or expressions on algebraic UDFs[LINK to definition]. 
 
Example:
 
A = load 'studenttab10k' as (name, age, gpa);
B = group A by age;
C = foreach B generate ABS(SUM(A.gpa)), 
COUNT(org.apache.pig.builtin.Distinct(A.name)), (MIN(A.gpa) + MAX(A.gpa))/2, 
group.age;
Explain C;
 
There are a number of things to note in this example:
 

Group can be referred to as a whole or by accessing individual fields as the 
case in this example. 
Group and its elements can appear anywhere in the projection 
A variety of expressions can be applied to algebraic functions including 

Column transformation function such as ABS applied to an algebraic function SUM 
An algebraic function (COUNT) can be applied to another algebraic function 
(Distinct) although only the inner function is computed using combiner 
Mathematical expression can be applied to one or more algebraic functions. 


You can check if the combiner is used for your query by running explain on the 
foreach alias as shown above. You should see the combine section in the Map 
Reduce part of the plan:
 
.....
Combine Plan
B: Local Rearrange[tuple]{bytearray}(false) - scope-42
|   |
|   Project[bytearray][0] - scope-43
|
|---C: New For Each(false,false,false)[bag] - scope-28
    |   |
    |   Project[bytearray][0] - scope-29
    |   |
    |  POUserFunc(org.apache.pig.builtin.SUM$Intermediate)[tuple] - scope-30
    |   |
    |   |---Project[bag][1] - scope-31
    |   |
    |  POUserFunc(org.apache.pig.builtin.Distinct$Intermediate)[tuple] - 
scope-32
    |   |
    |   |---Project[bag][2] - scope-33
    |
    |---POCombinerPackage[tuple]{bytearray} - scope-36--------
.....
 
Combiner is also used with nested foreach as long as the only nested operation 
used is DISTINCT:
 
A = load 'studenttab10k' as (name, age, gpa);
B = group A by age;
C = foreach B {
D = distinct (A.name);
generate group, COUNT(D);}
 
Finally, use of combiner is influenced by the surrounding environment of the 
GROUP/FOREACH statements. 
 
Combiner is generally not used if there is any operator that comes between the 
GROUP and the FOREACH in the execution plan. Even if in your script they come 
next to each other, the optimizer might re-arrange them as the case with the 
example below:
 
A = load 'studenttab10k' as (name, age, gpa);
B = group A by age;
C = foreach B generate group, COUNT (A);
D = filter C by group.age <30;
 
In this case, filter will be pushed above foreach which will prevent the use of 
combiner. Please, note that this script can be made more efficient by 
performing filtering before the group:
 
A = load 'studenttab10k' as (name, age, gpa);
B = filter A by age <30;
C = group B by age;
D = foreach C generate group, COUNT (B);
 
One exception from the above rule is limit. Starting with Pig 0.9, even if 
limit comes between GROUP and FOREACH, the combiner will still be used:
 
A = load 'studenttab10k' as (name, age, gpa);
B = group A by age;
C = foreach B generate group, COUNT (A);
D = limit C 20;
 
In this example the optimizer will push the limit above foreach which will not 
disable combiner.
 
Combiner is also not used in the case where multiple foreach statements are 
associated with the same group:
 
A = load 'studenttab10k' as (name, age, gpa);
B = group A by age;
C = foreach B generate group, COUNT (A);
D = foreach B generate group, MIN (A.gpa). MAX(A.gpa);
.....
 
Depending on your use case, it might be more efficient to split your script 
onto multiples.


> Need to document when combiner is used
> --------------------------------------
>
>                 Key: PIG-1790
>                 URL: https://issues.apache.org/jira/browse/PIG-1790
>             Project: Pig
>          Issue Type: Improvement
>          Components: documentation
>            Reporter: Olga Natkovich
>            Assignee: Corinne Chandel
>             Fix For: 0.9.0
>
>
> I serached through the documentation but could not find a section that 
> describes the cases under which combiner is used. Since combiner has such a 
> significant impact on query performance, I think it is important to add this 
> information. Also, with 0.9 we are expending combiner usage so having 
> documentation would be useful for that as well.
> Here are the JIRAs for combiner use slated for 0.9:
> https://issues.apache.org/jira/browse/PIG-750
> https://issues.apache.org/jira/browse/PIG-490
> https://issues.apache.org/jira/browse/PIG-946
> https://issues.apache.org/jira/browse/PIG-1735

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to