[ 
https://issues.apache.org/jira/browse/IMPALA-7754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679132#comment-16679132
 ] 

Paul Rogers edited comment on IMPALA-7754 at 11/8/18 1:54 AM:
--------------------------------------------------------------

There be dragons here. It seems that, without rewrites, we get slightly wrong 
results in certain pathological queries. Consider this case in 
{{test-empty.sql}}:

{noformat}
# Constant conjunct in the ON-clause of an outer join is
# assigned to the join.
select *
from functional.alltypessmall a
left outer join functional.alltypestiny b
on (a.id = b.id and 1 + 1 > 10)
---- PLAN
Max Per-Host Resource Reservation: Memory=1.95MB Threads=3
Per-Host Resource Estimates: Memory=66MB
PLAN-ROOT SINK
|
02:HASH JOIN [LEFT OUTER JOIN]
|  hash predicates: a.id = b.id
|  other join predicates: FALSE
|
|--01:SCAN HDFS [functional.alltypestiny b]
|     partitions=4/4 files=4 size=460B
|
00:SCAN HDFS [functional.alltypessmall a]
   partitions=4/4 files=4 size=6.32KB
{noformat}

If we re-analyze expressions after rewrite, we get the following, which seems 
more accurate:

{noformat}
Max Per-Host Resource Reservation: Memory=16.00KB Threads=3
Per-Host Resource Estimates: Memory=64MB
PLAN-ROOT SINK
|
02:NESTED LOOP JOIN [LEFT OUTER JOIN]
|  join predicates: (FALSE)
{noformat}

Note that, in the current expected results (the top example), the predicate 
{{a.id = b.id}} is pushed down, ignoring the fact that it is part of a 
predicate which reduces to {{FALSE}}. In the second version, the planner has 
correctly identified that the join produces no rows. Also, the memory cost is 
lower for the second example, Hence, the current behavior (first example) 
appears wrong.

The reason for this difference boils down to the lack of re-analysis in the 
expression rewriter. Lets' take it step-by-step.

* Pass {{a.id = b.id and 1 + 1 > 10}} to the rewriter.
* Nothing can be done with {{a.id = b.id}}.
* Constant folding: {{1 + 1 > 10}} becomes {{FALSE}}.
* Normalize binary predicates: {{a.id = b.id AND FALSE}} becomes {{FALSE AND 
a.id = b.id}}.
* The result of the above is not analyzed (it is a new node), so the call to 
simplify the above is skipped.

When re-analysis is added, the last step above changes to:

* Simplify conditional: {{FALSE AND a.id = b.id}} becomes {{FALSE}}.

Thus, the second plan shown above is the correct one, the first one is the 
result of an incomplete expression rewrite.

We can verify this by creating a "baseline" query and running it through 
{{PlannerTest}}: replace the ON clause with a simple {{FALSE}}:

{noformat}
select *
from functional.alltypessmall a
left outer join functional.alltypestiny b
on FALSE
---- PLAN
Max Per-Host Resource Reservation: Memory=16.00KB Threads=3
Per-Host Resource Estimates: Memory=64MB
PLAN-ROOT SINK
|
02:NESTED LOOP JOIN [LEFT OUTER JOIN]
|  join predicates: FALSE
|
|--01:SCAN HDFS [functional.alltypestiny b]
|     partitions=4/4 files=4 size=460B
|
00:SCAN HDFS [functional.alltypessmall a]
   partitions=4/4 files=4 size=6.32KB
{noformat}

The rewritten form of the query shown at the top of this node should produce an 
identical plan.


was (Author: paul.rogers):
There be dragons here. It seems that, without rewrites, we get slightly wrong 
results in certain pathological queries. Consider this case in 
{{test-empty.sql}}:

{noformat}
# Constant conjunct in the ON-clause of an outer join is
# assigned to the join.
select *
from functional.alltypessmall a
left outer join functional.alltypestiny b
on (a.id = b.id and 1 + 1 > 10)
---- PLAN
Max Per-Host Resource Reservation: Memory=1.95MB Threads=3
Per-Host Resource Estimates: Memory=66MB
PLAN-ROOT SINK
|
02:HASH JOIN [LEFT OUTER JOIN]
|  hash predicates: a.id = b.id
|  other join predicates: FALSE
|
|--01:SCAN HDFS [functional.alltypestiny b]
|     partitions=4/4 files=4 size=460B
|
00:SCAN HDFS [functional.alltypessmall a]
   partitions=4/4 files=4 size=6.32KB
{noformat}

If we re-analyze expressions after rewrite, we get the following, which seems 
more accurate:

{noformat}
Max Per-Host Resource Reservation: Memory=16.00KB Threads=3
Per-Host Resource Estimates: Memory=64MB
PLAN-ROOT SINK
|
02:NESTED LOOP JOIN [LEFT OUTER JOIN]
|  join predicates: (FALSE)
{noformat}

Note that, in the current expected results (the top example), the predicate 
{{a.id = b.id}} is pushed down, ignoring the fact that it is part of a 
predicate which reduces to {{FALSE}}. In the second version, the planner has 
correctly identified that the join produces no rows. Also, the memory cost is 
lower for the second example, Hence, the current behavior (first example) 
appears wrong.


> Expressions sometimes not re-analyzed after rewrite
> ---------------------------------------------------
>
>                 Key: IMPALA-7754
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7754
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 3.0
>            Reporter: Paul Rogers
>            Priority: Major
>
> The analyzer has a chain of rules which fire in order without (as noted 
> above) repeats. The result of rule A (rewriting conditional functions) is fed 
> into rule B (simplify CASE). Each rule requires that analysis be done so that 
> attributes of expressions can be picked out.
> As it turns out, in the current code, this is rather ad-hoc. The 
> {{SimplifyConditionalsRule}} re-analyzes its result as part of the fix for 
> IMPALA-5125, but others do not, leading to optimizations not working. In 
> particular, in a chain of rewrites for {{IS DISTINCT FROM}}, certain rules 
> didn't fire because previous rules left new expressions in an un-analyzed 
> state. This is a bug.
> The fix is to analyze the result any time a rule fires, before passing the 
> result to the next rule.
> {code:java}
>   private Expr applyRuleBottomUp(Expr expr, ExprRewriteRule rule, Analyzer 
> analyzer)
>       throws AnalysisException {
>     ...
>     Expr rewrittenExpr = rule.apply(expr, analyzer);
>     if (rewrittenExpr != expr) {
>       ++numChanges_;
>       rewrittenExpr.analyze(analyzer); // Add me!
>     }}
>     return rewrittenExpr;
>   }
> {code}
> There are several places that the above logic appears: make the change in all 
> of them.
> Then, in rules that simply refused to run if an expression is to analyzed:
> {code:java}
> public class SimplifyDistinctFromRule implements ExprRewriteRule {
>   public Expr apply(Expr expr, Analyzer analyzer) {
>     if (!expr.isAnalyzed()) return expr;
> {code}
> Replace this with an assertion that analysis must have been done:
> {code:java}
> public class SimplifyDistinctFromRule implements ExprRewriteRule {
>   public Expr apply(Expr expr, Analyzer analyzer) {
>     assert expr.isAnalyzed();
> {code}
> To be safe, the assertion fires only in "debug" mode, not in production.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to