[ https://issues.apache.org/jira/browse/HIVE-26722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alessandro Solimando updated HIVE-26722: ---------------------------------------- Description: h1. Reproducer Consider the following query: {code:java} set hive.cbo.rule.exclusion.regex=ReduceExpressionsRule\(Project\); CREATE EXTERNAL TABLE t (a string, b string); INSERT INTO t VALUES ('1000', 'b1'); INSERT INTO t VALUES ('2000', 'b2'); SELECT * FROM ( SELECT a, b FROM t UNION ALL SELECT a, CAST(NULL AS string) FROM t) AS t2 WHERE a = 1000;EXPLAIN CBO SELECT * FROM ( SELECT a, b FROM t UNION ALL SELECT a, CAST(NULL AS string) FROM t) AS t2 WHERE a = 1000; {code} The expected result is: {code:java} 1000 b1 1000 NULL{code} An example of correct plan is as follows: {noformat} CBO PLAN: HiveUnion(all=[true]) HiveProject(a=[$0], b=[$1]) HiveFilter(condition=[=(CAST($0):DOUBLE, 1000)]) HiveTableScan(table=[[default, t]], table:alias=[t]) HiveProject(a=[$0], _o__c1=[null:VARCHAR(2147483647) CHARACTER SET "UTF-16LE"]) HiveFilter(condition=[=(CAST($0):DOUBLE, 1000)]) HiveTableScan(table=[[default, t]], table:alias=[t]){noformat} Consider now a scenario where expression reduction in projections is disabled by setting the following property{_}:{_} {noformat} set hive.cbo.rule.exclusion.regex=ReduceExpressionsRule\(Project\); {noformat} In this case, the simplification of _CAST(NULL)_ into _NULL_ does not happen, and we get the following (invalid) result: {code:java} 1000 b1{code} produced by the following invalid plan: {code:java} CBO PLAN: HiveProject(a=[$0], b=[$1]) HiveFilter(condition=[=(CAST($0):DOUBLE, 1000)]) HiveTableScan(table=[[default, t]], table:alias=[t]) {code} h1. Problem Analysis At [HiveFilterSetOpTransposeRule.java#L112|https://github.com/apache/hive/blob/297f510d3b581c9d4079e42caa28aa84f8486012/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterSetOpTransposeRule.java#L112] the _RelMetadataQuery::getPulledUpPredicates_ method infers the following predicate due to the CAST(NULL) in the projection: {code:java} (=($1, CAST(null:NULL):VARCHAR(2147483647) CHARACTER SET "UTF-16LE")){code} When the CAST is simplified to the NULL literal, the IS_NULL($1) predicate is inferred. In [HiveFilterSetOpTransposeRule.java#L114-L122|https://github.com/apache/hive/blob/297f510d3b581c9d4079e42caa28aa84f8486012/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterSetOpTransposeRule.java#L114-L122], the rule checks if the conjunction of the predicate coming from the filter (here =(CAST($0):DOUBLE, 1000)) and the inferred predicates is satisfiable or not, under the _UnknownAsFalse_ semantics. To summarize, the following expression is simplified under the _UnknownAsFalse_ semantics: {code:java} AND((=($1, CAST(null:NULL):VARCHAR(2147483647) CHARACTER SET "UTF-16LE")), =(CAST($0):DOUBLE, 1000)) {code} Under In such semantics, (=($1, CAST(null:NULL):...) evaluates to {_}FALSE{_}, because no value is equal to NULL (even NULL itself), AND(FALSE, =(CAST($0):DOUBLE, 1000)) necessarily evaluates to _FALSE_ altogether, and the UNION ALL operand is pruned. Only by chance, when _CAST(NULL)_ is simplified to _NULL,_ we avoid the issue, due to the _IS_NULL($1)_ inferred predicate, see [HiveRelMdPredicates.java#L153-L156|https://github.com/apache/hive/blob/297f510d3b581c9d4079e42caa28aa84f8486012/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdPredicates.java#L153-L156] for understanding how the NULL literal is treated differently during predicate inference. _HiveRelMdPredicates_ should not use equality ('=') for nullable constant expressions, but rather IS NOT DISTINCT FROM, as detailed in HIVE-26733, but nonetheless the way simplification is done is not correct here, inferred predicates should be used as "context", rather than been used in a conjunctive expression, this usage does not conform with any of the similar uses of simplification with inferred predicates (see the bottom of the "Solution" section for examples and details). h1. Solution In order to correctly simplify a predicate and test if it's always false or not, we should build RexSimplify with _predicates_ as the list of predicates known to hold in the context. In this way, the different semantics are correctly taken into account. The code at [HiveFilterSetOpTransposeRule.java#L114-L121|https://github.com/apache/hive/blob/297f510d3b581c9d4079e42caa28aa84f8486012/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterSetOpTransposeRule.java#L114-L121] should be replaced by the following: {code:java} final RexExecutor executor = Util.first(filterRel.getCluster().getPlanner().getExecutor(), RexUtil.EXECUTOR); final RexSimplify simplify = new RexSimplify(rexBuilder, predicates, executor); final RexNode x = simplify.simplifyUnknownAs(newCondition, RexUnknownAs.FALSE);{code} This is in line with other uses of simplification, like in Calcite: [https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L704-L705] You can see the above method used for different kind of rels, always in the way identical to what this ticket advocates:filters: [https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L152-L155] projects: [https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L306-L310] joins: [https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L380-L383] windows: [https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L567-L576] was: h1. Reproducer Consider the following query: {code:java} set hive.cbo.rule.exclusion.regex=ReduceExpressionsRule\(Project\); CREATE EXTERNAL TABLE t (a string, b string); INSERT INTO t VALUES ('1000', 'b1'); INSERT INTO t VALUES ('2000', 'b2'); SELECT * FROM ( SELECT a, b FROM t UNION ALL SELECT a, CAST(NULL AS string) FROM t) AS t2 WHERE a = 1000;EXPLAIN CBO SELECT * FROM ( SELECT a, b FROM t UNION ALL SELECT a, CAST(NULL AS string) FROM t) AS t2 WHERE a = 1000; {code} The expected result is: {code:java} 1000 b1 1000 NULL{code} An example of correct plan is as follows: {noformat} CBO PLAN: HiveUnion(all=[true]) HiveProject(a=[$0], b=[$1]) HiveFilter(condition=[=(CAST($0):DOUBLE, 1000)]) HiveTableScan(table=[[default, t]], table:alias=[t]) HiveProject(a=[$0], _o__c1=[null:VARCHAR(2147483647) CHARACTER SET "UTF-16LE"]) HiveFilter(condition=[=(CAST($0):DOUBLE, 1000)]) HiveTableScan(table=[[default, t]], table:alias=[t]){noformat} Consider now a scenario where expression reduction in projections is disabled by setting the following property{_}:{_} {noformat} set hive.cbo.rule.exclusion.regex=ReduceExpressionsRule\(Project\); {noformat} In this case, the simplification of _CAST(NULL)_ into _NULL_ does not happen, and we get the following (invalid) result: {code:java} 1000 b1{code} produced by the following invalid plan: {code:java} CBO PLAN: HiveProject(a=[$0], b=[$1]) HiveFilter(condition=[=(CAST($0):DOUBLE, 1000)]) HiveTableScan(table=[[default, t]], table:alias=[t]) {code} h1. Problem Analysis At [HiveFilterSetOpTransposeRule.java#L112|https://github.com/apache/hive/blob/297f510d3b581c9d4079e42caa28aa84f8486012/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterSetOpTransposeRule.java#L112] the _RelMetadataQuery::getPulledUpPredicates_ method infers the following predicate due to the CAST(NULL) in the projection: {code:java} (=($1, CAST(null:NULL):VARCHAR(2147483647) CHARACTER SET "UTF-16LE")){code} When the CAST is simplified to the NULL literal, the IS_NULL($1) predicate is inferred. In [HiveFilterSetOpTransposeRule.java#L114-L122|https://github.com/apache/hive/blob/297f510d3b581c9d4079e42caa28aa84f8486012/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterSetOpTransposeRule.java#L114-L122], the rule checks if the conjunction of the predicate coming from the filter (here =(CAST($0):DOUBLE, 1000)) and the inferred predicates is satisfiable or not, under the _UnknownAsFalse_ semantics. To summarize, the following expression is simplified under the _UnknownAsFalse_ semantics: {code:java} AND((=($1, CAST(null:NULL):VARCHAR(2147483647) CHARACTER SET "UTF-16LE")), =(CAST($0):DOUBLE, 1000)) {code} Under In such semantics, (=($1, CAST(null:NULL):...) evaluates to {_}FALSE{_}, because no value is equal to NULL (even NULL itself), AND(FALSE, =(CAST($0):DOUBLE, 1000)) necessarily evaluates to _FALSE_ altogether, and the UNION ALL operand is pruned. Only by chance, when _CAST(NULL)_ is simplified to _NULL,_ we avoid the issue, due to the _IS_NULL($1)_ inferred predicate, see [HiveRelMdPredicates.java#L153-L156|https://github.com/apache/hive/blob/297f510d3b581c9d4079e42caa28aa84f8486012/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdPredicates.java#L153-L156] for understanding how the NULL literal is treated differently during predicate inference. The problem lies in the fact that, depending on the input _RelNode_ that we infer predicates from, the semantics is not necessarily {_}UnknownAsFalse{_}, but it might be {_}UnknownAsUnknown{_}, like for {_}Project{_}, as in this case. h1. Solution In order to correctly simplify a predicate and test if it's always false or not, we should build RexSimplify with _predicates_ as the list of predicates known to hold in the context. In this way, the different semantics are correctly taken into account. The code at [HiveFilterSetOpTransposeRule.java#L114-L121|https://github.com/apache/hive/blob/297f510d3b581c9d4079e42caa28aa84f8486012/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterSetOpTransposeRule.java#L114-L121] should be replaced by the following: {code:java} final RexExecutor executor = Util.first(filterRel.getCluster().getPlanner().getExecutor(), RexUtil.EXECUTOR); final RexSimplify simplify = new RexSimplify(rexBuilder, predicates, executor); final RexNode x = simplify.simplifyUnknownAs(newCondition, RexUnknownAs.FALSE);{code} This is in line with other uses of simplification, like in Calcite: [https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L704-L705] You can see the above method used for different kind of rels, always in the way identical to what this ticket advocates:filters: [https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L152-L155] projects: [https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L306-L310] joins: [https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L380-L383] windows: [https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L567-L576] > HiveFilterSetOpTransposeRule incorrectly prunes UNION ALL operands > ------------------------------------------------------------------ > > Key: HIVE-26722 > URL: https://issues.apache.org/jira/browse/HIVE-26722 > Project: Hive > Issue Type: Bug > Components: CBO > Affects Versions: 4.0.0-alpha-1 > Reporter: Alessandro Solimando > Assignee: Alessandro Solimando > Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > h1. Reproducer > Consider the following query: > {code:java} > set hive.cbo.rule.exclusion.regex=ReduceExpressionsRule\(Project\); > CREATE EXTERNAL TABLE t (a string, b string); > INSERT INTO t VALUES ('1000', 'b1'); > INSERT INTO t VALUES ('2000', 'b2'); > SELECT * FROM ( > SELECT > a, > b > FROM t > UNION ALL > SELECT > a, > CAST(NULL AS string) > FROM t) AS t2 > WHERE a = 1000;EXPLAIN CBO > SELECT * FROM ( > SELECT > a, > b > FROM t > UNION ALL > SELECT > a, > CAST(NULL AS string) > FROM t) AS t2 > WHERE a = 1000; {code} > The expected result is: > {code:java} > 1000 b1 > 1000 NULL{code} > An example of correct plan is as follows: > {noformat} > CBO PLAN: > HiveUnion(all=[true]) > HiveProject(a=[$0], b=[$1]) > HiveFilter(condition=[=(CAST($0):DOUBLE, 1000)]) > HiveTableScan(table=[[default, t]], table:alias=[t]) > HiveProject(a=[$0], _o__c1=[null:VARCHAR(2147483647) CHARACTER SET > "UTF-16LE"]) > HiveFilter(condition=[=(CAST($0):DOUBLE, 1000)]) > HiveTableScan(table=[[default, t]], table:alias=[t]){noformat} > > Consider now a scenario where expression reduction in projections is disabled > by setting the following property{_}:{_} > {noformat} > set hive.cbo.rule.exclusion.regex=ReduceExpressionsRule\(Project\); > {noformat} > In this case, the simplification of _CAST(NULL)_ into _NULL_ does not happen, > and we get the following (invalid) result: > {code:java} > 1000 b1{code} > produced by the following invalid plan: > {code:java} > CBO PLAN: > HiveProject(a=[$0], b=[$1]) > HiveFilter(condition=[=(CAST($0):DOUBLE, 1000)]) > HiveTableScan(table=[[default, t]], table:alias=[t]) {code} > h1. Problem Analysis > At > [HiveFilterSetOpTransposeRule.java#L112|https://github.com/apache/hive/blob/297f510d3b581c9d4079e42caa28aa84f8486012/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterSetOpTransposeRule.java#L112] > the _RelMetadataQuery::getPulledUpPredicates_ method infers the following > predicate due to the CAST(NULL) in the projection: > {code:java} > (=($1, CAST(null:NULL):VARCHAR(2147483647) CHARACTER SET "UTF-16LE")){code} > When the CAST is simplified to the NULL literal, the IS_NULL($1) predicate is > inferred. > In > [HiveFilterSetOpTransposeRule.java#L114-L122|https://github.com/apache/hive/blob/297f510d3b581c9d4079e42caa28aa84f8486012/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterSetOpTransposeRule.java#L114-L122], > the rule checks if the conjunction of the predicate coming from the filter > (here =(CAST($0):DOUBLE, 1000)) and the inferred predicates is satisfiable or > not, under the _UnknownAsFalse_ semantics. > To summarize, the following expression is simplified under the > _UnknownAsFalse_ semantics: > {code:java} > AND((=($1, CAST(null:NULL):VARCHAR(2147483647) CHARACTER SET "UTF-16LE")), > =(CAST($0):DOUBLE, 1000)) > {code} > Under In such semantics, (=($1, CAST(null:NULL):...) evaluates to > {_}FALSE{_}, because no value is equal to NULL (even NULL itself), AND(FALSE, > =(CAST($0):DOUBLE, 1000)) necessarily evaluates to _FALSE_ altogether, and > the UNION ALL operand is pruned. > Only by chance, when _CAST(NULL)_ is simplified to _NULL,_ we avoid the > issue, due to the _IS_NULL($1)_ inferred predicate, see > [HiveRelMdPredicates.java#L153-L156|https://github.com/apache/hive/blob/297f510d3b581c9d4079e42caa28aa84f8486012/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdPredicates.java#L153-L156] > for understanding how the NULL literal is treated differently during > predicate inference. > _HiveRelMdPredicates_ should not use equality ('=') for nullable constant > expressions, but rather IS NOT DISTINCT FROM, as detailed in HIVE-26733, but > nonetheless the way simplification is done is not correct here, inferred > predicates should be used as "context", rather than been used in a > conjunctive expression, this usage does not conform with any of the similar > uses of simplification with inferred predicates (see the bottom of the > "Solution" section for examples and details). > h1. Solution > In order to correctly simplify a predicate and test if it's always false or > not, we should build RexSimplify with _predicates_ as the list of predicates > known to hold in the context. In this way, the different semantics are > correctly taken into account. > The code at > [HiveFilterSetOpTransposeRule.java#L114-L121|https://github.com/apache/hive/blob/297f510d3b581c9d4079e42caa28aa84f8486012/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterSetOpTransposeRule.java#L114-L121] > should be replaced by the following: > {code:java} > final RexExecutor executor = > Util.first(filterRel.getCluster().getPlanner().getExecutor(), > RexUtil.EXECUTOR); > final RexSimplify simplify = new RexSimplify(rexBuilder, predicates, > executor); > final RexNode x = simplify.simplifyUnknownAs(newCondition, > RexUnknownAs.FALSE);{code} > This is in line with other uses of simplification, like in Calcite: > [https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L704-L705] > You can see the above method used for different kind of rels, always in the > way identical to what this ticket advocates:filters: > [https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L152-L155] > projects: > [https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L306-L310] > joins: > [https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L380-L383] > windows: > [https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L567-L576] -- This message was sent by Atlassian Jira (v8.20.10#820010)