[ 
https://issues.apache.org/jira/browse/PIG-158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602870#action_12602870
 ] 

sms edited comment on PIG-158 at 6/5/08 6:02 PM:
-----------------------------------------------------------------

Eliminating the Generate Operator

It has been recommended earlier (Thanks Pi) that we eliminate the Generate 
operator in the Foreach ... Generate context.

In the types branch, we have a Generate operator (on the logical and physical 
side) that is a container for the expressions that are projected. The Generate 
operator wraps each operator inside a nested plan. The resulting list of plans 
can be a mixture of expressions that derive their input from generate's 
predecessor or directly from the foreach input. Examples that illustrate these 
points follow.

{code}

--Example 1

a = load 'input1';
b = group a by $0;
c = foreach b {
        d = distinct a;
        generate group, sum(d.$1);
}

{code}

Logical plan after parsing:

{noformat}

ForEach Test-Plan-Builder-655
|   |
|   Generate Test-Plan-Builder-654
|   |   |
|   |   Project Test-Plan-Builder-650
|   |   |
|   |   UserFunc Test-Plan-Builder-653
|   |   |
|   |   |---Project Test-Plan-Builder-652
|   |
|   |---Distinct Test-Plan-Builder-649
|       |
|       |---Project Test-Plan-Builder-648
|
|---CoGroup Test-Plan-Builder-647
    |   |
    |   Project Test-Plan-Builder-646
    |
    |---Load Test-Plan-Builder-645

{noformat}

The Generate operator has 2 nested plans, one for the Project(group, b) and the 
other for the aggregate (sum). There are a couple of points to observe:

1. The projection of 'group' does not require the input 'd'. 
2. The root of the second plan Project(1, project(d, b)) requires the input 'd' 
which is connected to Generate but not as input in the nested plan.

The former should be part of the Foreach operator and the latter is a problem 
on the physical side. When the getNext call is made for the root of the nested 
plan, the input from generate is sought whereas the input from Distinct (d) is 
required.

Let us look at another example. Here, input 'd' is used twice in the generate. 
This is a case of an implicit split. The output of 'd' has to be split to both 
the sum and the count.

{code}

--Example 2

a = load 'input1';
b = group a by $0;
c = foreach b {
        d = distinct a;
        generate sum(d.$1), count(d.$1);
}

{code}

In order to remove the Generate operator, the nested plans which are currently 
part of the Generate will be promoted to be a part of the Foreach operator with 
the following changes:

1. Any expression that is part of the generate (root of the nested plan) which 
does not require generate's input will be moved into a nested plan of Foreach.

2. The remaining expressions of generate will be attached as leaves of 
generate's input by duplicating the graph.

Going back to example 1, the logical plan for Foreach will have two nested 
plans. The first nested plan will contain Project(group, b). The second nested 
plan will have 'd' as the root and the aggregate function sum as the leaf

Example 2 will translate to two nested plans both of which will have 'd' as the 
input. The leaves of the individual plans will be the aggregate functions sum 
and count respectively.

      was (Author: sms):
    Eliminating the Generate Operator

It has been recommended earlier (Thanks Pi) that we eliminate the Generate 
operator in the Foreach ... Generate context.

In the types branch, we have a Generate operator (on the logical and physical 
side) that is a container for the expressions that are projected. The Generate 
operator wraps each operator inside a nested plan. The resulting list of plans 
can be a mixture of expressions that derive their input from generate's 
predecessor or directly from the foreach input. Examples that illustrate these 
points follow.

{code}

--Example 1

a = load 'input1';
b = group a by $0;
c = foreach b {
        d = distinct a;
        generate group, sum(d.$1);
}

{code}

Logical plan after parsing:

ForEach Test-Plan-Builder-655
|   |
|   Generate Test-Plan-Builder-654
|   |   |
|   |   Project Test-Plan-Builder-650
|   |   |
|   |   UserFunc Test-Plan-Builder-653
|   |   |
|   |   |---Project Test-Plan-Builder-652
|   |
|   |---Distinct Test-Plan-Builder-649
|       |
|       |---Project Test-Plan-Builder-648
|
|---CoGroup Test-Plan-Builder-647
    |   |
    |   Project Test-Plan-Builder-646
    |
    |---Load Test-Plan-Builder-645


The Generate operator has 2 nested plans, one for the Project(group, b) and the 
other for the aggregate (sum). There are a couple of points to observe:

1. The projection of 'group' does not require the input 'd'. 
2. The root of the second plan Project(1, project(d, b)) requires the input 'd' 
which is connected to Generate but not as input in the nested plan.

The former should be part of the Foreach operator and the latter is a problem 
on the physical side. When the getNext call is made for the root of the nested 
plan, the input from generate is sought whereas the input from Distinct (d) is 
required.

Let us look at another example. Here, input 'd' is used twice in the generate. 
This is a case of an implicit split. The output of 'd' has to be split to both 
the sum and the count.

{code}

--Example 2

a = load 'input1';
b = group a by $0;
c = foreach b {
        d = distinct a;
        generate sum(d.$1), count(d.$1);
}

{code}

In order to remove the Generate operator, the nested plans which are currently 
part of the Generate will be promoted to be a part of the Foreach operator with 
the following changes:

1. Any expression that is part of the generate (root of the nested plan) which 
does not require generate's input will be moved into a nested plan of Foreach.

2. The remaining expressions of generate will be attached as leaves of 
generate's input by duplicating the graph.

Going back to example 1, the logical plan for Foreach will have two nested 
plans. The first nested plan will contain Project(group, b). The second nested 
plan will have 'd' as the root and the aggregate function sum as the leaf

Example 2 will translate to two nested plans both of which will have 'd' as the 
input. The leaves of the individual plans will be the aggregate functions sum 
and count respectively.
  
> Rework logical plan
> -------------------
>
>                 Key: PIG-158
>                 URL: https://issues.apache.org/jira/browse/PIG-158
>             Project: Pig
>          Issue Type: Sub-task
>          Components: impl
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: is_null.patch, logical_operators.patch, 
> logical_operators_rev_1.patch, logical_operators_rev_2.patch, 
> logical_operators_rev_3.patch, parser_changes.patch, parser_changes_v1.patch, 
> parser_changes_v2.patch, parser_changes_v3.patch, parser_changes_v4.patch, 
> ParserErrors.txt, udf_fix.patch, udf_funcSpec.patch, udf_return_type.patch, 
> user_func_and_store.patch, visitorWalker.patch
>
>
> Rework the logical plan in line with 
> http://wiki.apache.org/pig/PigExecutionModel

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to