[ 
https://issues.apache.org/jira/browse/IMPALA-7997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers reassigned IMPALA-7997:
-----------------------------------

    Assignee:     (was: Paul Rogers)

> Rewrites omitted for ON clause, test case expect wrong plans
> ------------------------------------------------------------
>
>                 Key: IMPALA-7997
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7997
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 3.1.0
>            Reporter: Paul Rogers
>            Priority: Minor
>
> Bugs in the rewrite code prevented the {{ON}} clause from being subject to 
> the full set of rewrite rules. In particular, it appears that the “simplify 
> conditions” rule was not applied. An effort to fix rewrites modified the code 
> to correctly apply rewrites for {{ON}}. This exposed a bug in the planner for 
> certain unrealistic, pathological queries.
> {{PlannerTest.empy.test}} has the following test case:
> {code:sql}
> # Constant conjunct in the ON-clause of an outer join is
> # assigned to the join.
> select *
> from functional.alltypessmall a
> right outer join functional.alltypestiny b
> on (a.id = b.id and !true)
> {code}
> Note the {{ON}} clause meaning, which is realized after rewrites:
> {noformat}
> a.id = b.id and !true --> a.id = b.id AND FALSE --> FALSE
> {noformat}
> That is, no rows in the join match. Currently expected (incorrect) plan:
> {noformat}
> PLAN-ROOT SINK
> |
> 02:HASH JOIN [RIGHT OUTER JOIN]
> |  hash predicates: a.id = b.id
> |  other join predicates: FALSE
> |  runtime filters: RF000 <- b.id
> |
> |--01:SCAN HDFS [functional.alltypestiny b]
> |     partitions=4/4 files=4 size=460B
> |
> 00:SCAN HDFS [functional.alltypessmall a]
>    partitions=4/4 files=4 size=6.32KB
>    runtime filters: RF000 -> a.id
> {noformat}
> Note the attempt to push the runtime filter for {{b.id}} into the left HDFS 
> scan, then the query throws away the rows because of the {{FALSE}} join 
> condition.
> Once rewrites are applied, we get a new plan, which is actually worse (see 
> IMPALA-7996):
> {noformat}
> PLAN-ROOT SINK
> |
> 02:NESTED LOOP JOIN [RIGHT OUTER JOIN]
> |  join predicates: FALSE
> |
> |--01:SCAN HDFS [functional.alltypestiny b]
> |     partitions=4/4 files=4 size=460B
> |
> 00:SCAN HDFS [functional.alltypessmall a]
>    partitions=4/4 files=4 size=6.32KB
> {noformat}
> Similar case:
> {code:sql}
> # Constant conjunct in the ON-clause of an outer join is
> # assigned to the join.
> select *
> from functional.alltypessmall a
> left outer join functional.alltypestiny b
> on (a.id = b.id and 1 + 1 > 10)
> {code}
> Reasoning:
> {noformat}
> a.id = b.id and 1 + 1 > 10 --> a.id = b.id AND 2 > 10
>   --> a.id = b.id AND FALSE --> FALSE
> {noformat}
> The current expected plan:
> {noformat}
> PLAN-ROOT SINK
> |
> 02:HASH JOIN [LEFT OUTER JOIN]
> |  hash predicates: a.id = b.id
> |  other join predicates: FALSE
> |
> |--01:SCAN HDFS [functional.alltypestiny b]
> |     partitions=4/4 files=4 size=460B
> |
> 00:SCAN HDFS [functional.alltypessmall a]
>    partitions=4/4 files=4 size=6.32KB
> {noformat}
> The revised plan, after rewrites folding:
> {noformat}
> PLAN-ROOT SINK
> |
> 02:NESTED LOOP JOIN [LEFT OUTER JOIN]
> |  join predicates: FALSE
> |
> |--01:SCAN HDFS [functional.alltypestiny b]
> |     partitions=4/4 files=4 size=460B
> |
> 00:SCAN HDFS [functional.alltypessmall a]
>    partitions=4/4 files=4 size=6.32KB
> {noformat}
> The lack of simplification also shows up in {{resource-requirements.test}}
> Previous:
> {noformat}
> Analyzed query: SELECT c_name, c_custkey, o_orderkey, o_orderdate, 
> o_totalprice,
> sum(l_quantity) FROM tpch.customer, tpch.orders, tpch.lineitem LEFT SEMI JOIN
> (SELECT l_orderkey FROM tpch.lineitem GROUP BY l_orderkey HAVING $ao$1 /*
> sum(l_quantity) */ > CAST(300 AS DECIMAL(5,0))) `$a$1` (`$c$1`) ON o_orderkey 
> =
> `$a$1`.`$c$1` WHERE TRUE AND c_custkey = o_custkey AND o_orderkey = l_orderkey
> GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice ORDER BY
> o_totalprice DESC, o_orderdate ASC LIMIT CAST(100 AS TINYINT)
> {noformat}
> Revised:
> {noformat}
> Analyzed query: SELECT c_name, c_custkey, o_orderkey, o_orderdate, 
> o_totalprice,
> sum(l_quantity) FROM tpch.customer, tpch.orders, tpch.lineitem LEFT SEMI JOIN
> (SELECT l_orderkey FROM tpch.lineitem GROUP BY l_orderkey HAVING $ao$1 /*
> sum(l_quantity) */ > CAST(300 AS DECIMAL(5,0))) `$a$1` (`$c$1`) ON o_orderkey 
> =
> `$a$1`.`$c$1` WHERE c_custkey = o_custkey AND o_orderkey = l_orderkey GROUP BY
> c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice ORDER BY o_totalprice
> DESC, o_orderdate ASC LIMIT CAST(100 AS TINYINT)
> {noformat}
> Notice the elimination of {{TRUE AND}} in the revised {{WHERE}} clause.
> Also in {{subquery-rewrite.test}}, previous:
> {noformat}
> |     predicates: FALSE OR functional.alltypestiny.int_col IS NULL OR 
> functional.alltypestiny.int_col = 1
> {noformat}
> Revised:
> {noformat}
> |     predicates: functional.alltypestiny.int_col IS NULL OR 
> functional.alltypestiny.int_col = 1
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to