[
https://issues.apache.org/jira/browse/PIG-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Olga Natkovich updated PIG-1790:
--------------------------------
Assignee: Corinne Chandel (was: Olga Natkovich)
The following information should be added to the cookbook:
Make Sure Combiner is Used
Whenever possible make sure that combiner is used as it frequently yields order
of magnitude improvement in performance. Combiner is generally used in case of
non-nested foreach where all projections are either expressions on the group
column or expressions on algebraic UDFs[LINK to definition].
Example:
A = load 'studenttab10k' as (name, age, gpa);
B = group A by age;
C = foreach B generate ABS(SUM(A.gpa)),
COUNT(org.apache.pig.builtin.Distinct(A.name)), (MIN(A.gpa) + MAX(A.gpa))/2,
group.age;
Explain C;
There are a number of things to note in this example:
Group can be referred to as a whole or by accessing individual fields as the
case in this example.
Group and its elements can appear anywhere in the projection
A variety of expressions can be applied to algebraic functions including
Column transformation function such as ABS applied to an algebraic function SUM
An algebraic function (COUNT) can be applied to another algebraic function
(Distinct) although only the inner function is computed using combiner
Mathematical expression can be applied to one or more algebraic functions.
You can check if the combiner is used for your query by running explain on the
foreach alias as shown above. You should see the combine section in the Map
Reduce part of the plan:
.....
Combine Plan
B: Local Rearrange[tuple]{bytearray}(false) - scope-42
| |
| Project[bytearray][0] - scope-43
|
|---C: New For Each(false,false,false)[bag] - scope-28
| |
| Project[bytearray][0] - scope-29
| |
| POUserFunc(org.apache.pig.builtin.SUM$Intermediate)[tuple] - scope-30
| |
| |---Project[bag][1] - scope-31
| |
| POUserFunc(org.apache.pig.builtin.Distinct$Intermediate)[tuple] -
scope-32
| |
| |---Project[bag][2] - scope-33
|
|---POCombinerPackage[tuple]{bytearray} - scope-36--------
.....
Combiner is also used with nested foreach as long as the only nested operation
used is DISTINCT:
A = load 'studenttab10k' as (name, age, gpa);
B = group A by age;
C = foreach B {
D = distinct (A.name);
generate group, COUNT(D);}
Finally, use of combiner is influenced by the surrounding environment of the
GROUP/FOREACH statements.
Combiner is generally not used if there is any operator that comes between the
GROUP and the FOREACH in the execution plan. Even if in your script they come
next to each other, the optimizer might re-arrange them as the case with the
example below:
A = load 'studenttab10k' as (name, age, gpa);
B = group A by age;
C = foreach B generate group, COUNT (A);
D = filter C by group.age <30;
In this case, filter will be pushed above foreach which will prevent the use of
combiner. Please, note that this script can be made more efficient by
performing filtering before the group:
A = load 'studenttab10k' as (name, age, gpa);
B = filter A by age <30;
C = group B by age;
D = foreach C generate group, COUNT (B);
One exception from the above rule is limit. Starting with Pig 0.9, even if
limit comes between GROUP and FOREACH, the combiner will still be used:
A = load 'studenttab10k' as (name, age, gpa);
B = group A by age;
C = foreach B generate group, COUNT (A);
D = limit C 20;
In this example the optimizer will push the limit above foreach which will not
disable combiner.
Combiner is also not used in the case where multiple foreach statements are
associated with the same group:
A = load 'studenttab10k' as (name, age, gpa);
B = group A by age;
C = foreach B generate group, COUNT (A);
D = foreach B generate group, MIN (A.gpa). MAX(A.gpa);
.....
Depending on your use case, it might be more efficient to split your script
onto multiples.
> Need to document when combiner is used
> --------------------------------------
>
> Key: PIG-1790
> URL: https://issues.apache.org/jira/browse/PIG-1790
> Project: Pig
> Issue Type: Improvement
> Components: documentation
> Reporter: Olga Natkovich
> Assignee: Corinne Chandel
> Fix For: 0.9.0
>
>
> I serached through the documentation but could not find a section that
> describes the cases under which combiner is used. Since combiner has such a
> significant impact on query performance, I think it is important to add this
> information. Also, with 0.9 we are expending combiner usage so having
> documentation would be useful for that as well.
> Here are the JIRAs for combiner use slated for 0.9:
> https://issues.apache.org/jira/browse/PIG-750
> https://issues.apache.org/jira/browse/PIG-490
> https://issues.apache.org/jira/browse/PIG-946
> https://issues.apache.org/jira/browse/PIG-1735
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.