[ https://issues.apache.org/jira/browse/IMPALA-7997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Paul Rogers reassigned IMPALA-7997: ----------------------------------- Assignee: (was: Paul Rogers) > Rewrites omitted for ON clause, test case expect wrong plans > ------------------------------------------------------------ > > Key: IMPALA-7997 > URL: https://issues.apache.org/jira/browse/IMPALA-7997 > Project: IMPALA > Issue Type: Bug > Components: Frontend > Affects Versions: Impala 3.1.0 > Reporter: Paul Rogers > Priority: Minor > > Bugs in the rewrite code prevented the {{ON}} clause from being subject to > the full set of rewrite rules. In particular, it appears that the “simplify > conditions” rule was not applied. An effort to fix rewrites modified the code > to correctly apply rewrites for {{ON}}. This exposed a bug in the planner for > certain unrealistic, pathological queries. > {{PlannerTest.empy.test}} has the following test case: > {code:sql} > # Constant conjunct in the ON-clause of an outer join is > # assigned to the join. > select * > from functional.alltypessmall a > right outer join functional.alltypestiny b > on (a.id = b.id and !true) > {code} > Note the {{ON}} clause meaning, which is realized after rewrites: > {noformat} > a.id = b.id and !true --> a.id = b.id AND FALSE --> FALSE > {noformat} > That is, no rows in the join match. Currently expected (incorrect) plan: > {noformat} > PLAN-ROOT SINK > | > 02:HASH JOIN [RIGHT OUTER JOIN] > | hash predicates: a.id = b.id > | other join predicates: FALSE > | runtime filters: RF000 <- b.id > | > |--01:SCAN HDFS [functional.alltypestiny b] > | partitions=4/4 files=4 size=460B > | > 00:SCAN HDFS [functional.alltypessmall a] > partitions=4/4 files=4 size=6.32KB > runtime filters: RF000 -> a.id > {noformat} > Note the attempt to push the runtime filter for {{b.id}} into the left HDFS > scan, then the query throws away the rows because of the {{FALSE}} join > condition. > Once rewrites are applied, we get a new plan, which is actually worse (see > IMPALA-7996): > {noformat} > PLAN-ROOT SINK > | > 02:NESTED LOOP JOIN [RIGHT OUTER JOIN] > | join predicates: FALSE > | > |--01:SCAN HDFS [functional.alltypestiny b] > | partitions=4/4 files=4 size=460B > | > 00:SCAN HDFS [functional.alltypessmall a] > partitions=4/4 files=4 size=6.32KB > {noformat} > Similar case: > {code:sql} > # Constant conjunct in the ON-clause of an outer join is > # assigned to the join. > select * > from functional.alltypessmall a > left outer join functional.alltypestiny b > on (a.id = b.id and 1 + 1 > 10) > {code} > Reasoning: > {noformat} > a.id = b.id and 1 + 1 > 10 --> a.id = b.id AND 2 > 10 > --> a.id = b.id AND FALSE --> FALSE > {noformat} > The current expected plan: > {noformat} > PLAN-ROOT SINK > | > 02:HASH JOIN [LEFT OUTER JOIN] > | hash predicates: a.id = b.id > | other join predicates: FALSE > | > |--01:SCAN HDFS [functional.alltypestiny b] > | partitions=4/4 files=4 size=460B > | > 00:SCAN HDFS [functional.alltypessmall a] > partitions=4/4 files=4 size=6.32KB > {noformat} > The revised plan, after rewrites folding: > {noformat} > PLAN-ROOT SINK > | > 02:NESTED LOOP JOIN [LEFT OUTER JOIN] > | join predicates: FALSE > | > |--01:SCAN HDFS [functional.alltypestiny b] > | partitions=4/4 files=4 size=460B > | > 00:SCAN HDFS [functional.alltypessmall a] > partitions=4/4 files=4 size=6.32KB > {noformat} > The lack of simplification also shows up in {{resource-requirements.test}} > Previous: > {noformat} > Analyzed query: SELECT c_name, c_custkey, o_orderkey, o_orderdate, > o_totalprice, > sum(l_quantity) FROM tpch.customer, tpch.orders, tpch.lineitem LEFT SEMI JOIN > (SELECT l_orderkey FROM tpch.lineitem GROUP BY l_orderkey HAVING $ao$1 /* > sum(l_quantity) */ > CAST(300 AS DECIMAL(5,0))) `$a$1` (`$c$1`) ON o_orderkey > = > `$a$1`.`$c$1` WHERE TRUE AND c_custkey = o_custkey AND o_orderkey = l_orderkey > GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice ORDER BY > o_totalprice DESC, o_orderdate ASC LIMIT CAST(100 AS TINYINT) > {noformat} > Revised: > {noformat} > Analyzed query: SELECT c_name, c_custkey, o_orderkey, o_orderdate, > o_totalprice, > sum(l_quantity) FROM tpch.customer, tpch.orders, tpch.lineitem LEFT SEMI JOIN > (SELECT l_orderkey FROM tpch.lineitem GROUP BY l_orderkey HAVING $ao$1 /* > sum(l_quantity) */ > CAST(300 AS DECIMAL(5,0))) `$a$1` (`$c$1`) ON o_orderkey > = > `$a$1`.`$c$1` WHERE c_custkey = o_custkey AND o_orderkey = l_orderkey GROUP BY > c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice ORDER BY o_totalprice > DESC, o_orderdate ASC LIMIT CAST(100 AS TINYINT) > {noformat} > Notice the elimination of {{TRUE AND}} in the revised {{WHERE}} clause. > Also in {{subquery-rewrite.test}}, previous: > {noformat} > | predicates: FALSE OR functional.alltypestiny.int_col IS NULL OR > functional.alltypestiny.int_col = 1 > {noformat} > Revised: > {noformat} > | predicates: functional.alltypestiny.int_col IS NULL OR > functional.alltypestiny.int_col = 1 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org