[
https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887923#action_12887923
]
Yan Zhou commented on PIG-1399:
-------------------------------
This work is not to optimize on the generic boolean logics, but rather to
simplify the logic expression based upon the constant values as the logical
expression's operands. The former, e.g., would change an boolean expression of
((A AND B) OR (A AND C)) to (A AND (B OR C)); while the latter will change,
say, (a0 > 5 and a0 > 7) to (a0 > 7). It is, therefore, up to the query
writer to optimize his/her boolean logic, probably through use of some other
tools.
The algorithm works in a series of steps in order:
1) a constant expression evaluation visitor that will evaluate the constant
expressions. It works by traversing the expression tree in a bottom-up manner
and evaluate all subexpressions that have all constant subexpressions. All
results from constant children are pushed to a stack for the parent to digest
for its own evaluation. Any non-constant expression will push a null to the
stack and consequently will cause all of its ancestors not to be evaluated.
For simplicity, only constant binary expressions and constant unary expressions
and evaluated. More complex expressions are not evaluated at this moment. For
UDF, this evaluation is not planned at all for fear of possible stateful
consequences resulting from the original evaluations;
2) A NOT conversion visitor that will traverse the expression tree in a
depth-first manner with post-order handling. A status of negativity for a NOT
expression is recorded in the depth-first traversal before subtree traversal
and reversed after traversing the subtree. All "reversible" expressions is
replaced by its negated counterpart for negative negativity. Currently equality
ops, and its non-equality couter part, all range comparisons, logical AND and
OR are reversible.
Notably missing is the "is null" for lack of a "is not null" base expression;
3) A DNF plan is generated, through a helper DNFPlanGenerator visitor class,
whose disjunctions are either of OrExpression or a new DNFExpression with type
of "OR", whose conjunctions are either of AndExpression or the new
DNFExpression with type of "AND".
The introduction of the new DNFExpression, which extends LogiclExpression,
is to support multiple children (vs. the two children in a BinaryExpression) to
facilitate the processing of multiple children
of an "OR" or "AND" operator due to the commutative property of the two
operators. The leaves of the DNF are of a new LogiclExpressionProxy type that
extends the LogicalExpression.
This new type is to be used as a proxy toward the original leaf expression
in the original filter plan. The purpose is to track how often an original
expression has been put in
the DNF plan as result of the normalizing process. Consequently, a
DNFSplitCounter member is added to the LogicalExpression, which is incremented
once a new proxy is created
on the original expression. Due to the potentially exponential growth of the
DNF plan, and the nonlinear complexity to trim the DNF plan (see 4 below), the
size of the DNF plan is limited to 100 nodes beyind which the simplification
beyond step 2) are just skipped;
4) Then the DNF plan is trimmed according to the inferrence rules between the
operands of the conjunctions first, and then between the operands of the
disjuction in the DNF plan.
If a leaf is trimmed, the counter, DNFSpliCounter, of the source of the
proxy will be decremented. Basically, the DNF plan is used as a utility to
determine if an original leaf
expression can be trimmed from the original filter plan or not. If all
proxies of the original leaf expression have been trimmed from the DNF plan,
the original leaf expression can be trimmed from the original plan then.
The point is that the DNF plan is not intended to replace the original filer
plan since the DNF plan in general tends to be more expensive to evaluate than
the original filter plan.
5) The original filter plan is traversed in a bottom-up manner so that if a
leaf's DNFSpliCounter is zero, which means all of its proxies on DNF has been
trimmed, the leaf will be trimmed.
For nonleafs of "AND" or "OR" expressions, if one child survives, the child
will be relinked to the predecessor(s). If either or both children are trimmed,
the nonleaf will be trimmed
too. If the whole new filter plan is empty, the filter operator will be
removed from the logical plan too.
Using a example of "B = filter A by NOT((a0 > 1) or (a1 < 3 and a0 >3+2))":
After 1), the filter plan becomes "NOT((a0 > 1) or (a1 < 3 and a0 >5))";
After 2), the filter plan becomes "(a0 <= 1) AND ((a1 >= 3) OR (a0 <= 5))";
After 3), the DNF plan is "((a0 <= 1) AND (a1 >= 3)) OR ((a0 <= 1) AND (a0 <=
5))";
After 4), the DNF plan becomes "a0 <= 1";
After 5), the filter plan becomes "a0 <= 1".
> Logical Optimizer: Expression optimizor rule
> --------------------------------------------
>
> Key: PIG-1399
> URL: https://issues.apache.org/jira/browse/PIG-1399
> Project: Pig
> Issue Type: Sub-task
> Components: impl
> Affects Versions: 0.7.0
> Reporter: Daniel Dai
> Assignee: Yan Zhou
> Fix For: 0.8.0
>
>
> We can optimize expression in several ways:
> 1. Constant pre-calculation
> Example:
> B = filter A by a0 > 5+7;
> => B = filter A by a0 > 12;
> 2. Boolean expression optimization
> Example:
> B = filter A by not (not(a0>5) or a>10);
> => B = filter A by a0>5 and a<=10;
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.