[
https://issues.apache.org/jira/browse/PIG-158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602870#action_12602870
]
Santhosh Srinivasan commented on PIG-158:
-----------------------------------------
Eliminating the Generate Operator
It has been recommended earlier (Thanks Pi) that we eliminate the Generate
operator in the Foreach ... Generate context.
In the types branch, we have a Generate operator (on the logical and physical
side) that is a container for the expressions that are projected. The Generate
operator wraps each operator inside a nested plan. The resulting list of plans
can be a mixture of expressions that derive their input from generate's
predecessor or directly from the foreach input. Examples that illustrate these
points follow.
{code}
--Example 1
a = load 'input1';
b = group a by $0;
c = foreach b {
d = distinct a;
generate group, sum(d.$1);
}
{code}
Logical plan after parsing:
ForEach Test-Plan-Builder-655
| |
| Generate Test-Plan-Builder-654
| | |
| | Project Test-Plan-Builder-650
| | |
| | UserFunc Test-Plan-Builder-653
| | |
| | |---Project Test-Plan-Builder-652
| |
| |---Distinct Test-Plan-Builder-649
| |
| |---Project Test-Plan-Builder-648
|
|---CoGroup Test-Plan-Builder-647
| |
| Project Test-Plan-Builder-646
|
|---Load Test-Plan-Builder-645
The Generate operator has 2 nested plans, one for the Project(group, b) and the
other for the aggregate (sum). There are a couple of points to observe:
1. The projection of 'group' does not require the input 'd'.
2. The root of the second plan Project(1, project(d, b)) requires the input 'd'
which is connected to Generate but not as input in the nested plan.
The former should be part of the Foreach operator and the latter is a problem
on the physical side. When the getNext call is made for the root of the nested
plan, the input from generate is sought whereas the input from Distinct (d) is
required.
Let us look at another example. Here, input 'd' is used twice in the generate.
This is a case of an implicit split. The output of 'd' has to be split to both
the sum and the count.
{code}
--Example 2
a = load 'input1';
b = group a by $0;
c = foreach b {
d = distinct a;
generate sum(d.$1), count(d.$1);
}
{code}
In order to remove the Generate operator, the nested plans which are currently
part of the Generate will be promoted to be a part of the Foreach operator with
the following changes:
1. Any expression that is part of the generate (root of the nested plan) which
does not require generate's input will be moved into a nested plan of Foreach.
2. The remaining expressions of generate will be attached as leaves of
generate's input by duplicating the graph.
Going back to example 1, the logical plan for Foreach will have two nested
plans. The first nested plan will contain Project(group, b). The second nested
plan will have 'd' as the root and the aggregate function sum as the leaf
Example 2 will translate to two nested plans both of which will have 'd' as the
input. The leaves of the individual plans will be the aggregate functions sum
and count respectively.
> Rework logical plan
> -------------------
>
> Key: PIG-158
> URL: https://issues.apache.org/jira/browse/PIG-158
> Project: Pig
> Issue Type: Sub-task
> Components: impl
> Reporter: Alan Gates
> Assignee: Alan Gates
> Attachments: is_null.patch, logical_operators.patch,
> logical_operators_rev_1.patch, logical_operators_rev_2.patch,
> logical_operators_rev_3.patch, parser_changes.patch, parser_changes_v1.patch,
> parser_changes_v2.patch, parser_changes_v3.patch, parser_changes_v4.patch,
> ParserErrors.txt, udf_fix.patch, udf_funcSpec.patch, udf_return_type.patch,
> user_func_and_store.patch, visitorWalker.patch
>
>
> Rework the logical plan in line with
> http://wiki.apache.org/pig/PigExecutionModel
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.