[ 
https://issues.apache.org/jira/browse/HIVE-26722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Solimando updated HIVE-26722:
----------------------------------------
    Description: 
h1. Reproducer

Consider the following query:
{code:java}
set hive.cbo.rule.exclusion.regex=ReduceExpressionsRule\(Project\);

CREATE EXTERNAL TABLE t (a string, b string);
INSERT INTO t VALUES ('1000', 'b1');
INSERT INTO t VALUES ('2000', 'b2');

SELECT * FROM (
  SELECT
   a,
   b
  FROM t
   UNION ALL
  SELECT
   a,
   CAST(NULL AS string)
   FROM t) AS t2
WHERE a = 1000;EXPLAIN CBO
SELECT * FROM (
  SELECT
   a,
   b
  FROM t
   UNION ALL
  SELECT
   a,
   CAST(NULL AS string)
   FROM t) AS t2
WHERE a = 1000; {code}
The expected result is:
{code:java}
1000    b1
1000    NULL{code}
An example of correct plan is as follows:
{noformat}
CBO PLAN:
HiveUnion(all=[true])
  HiveProject(a=[$0], b=[$1])
    HiveFilter(condition=[=(CAST($0):DOUBLE, 1000)])
      HiveTableScan(table=[[default, t]], table:alias=[t])
  HiveProject(a=[$0], _o__c1=[null:VARCHAR(2147483647) CHARACTER SET 
"UTF-16LE"])
    HiveFilter(condition=[=(CAST($0):DOUBLE, 1000)])
      HiveTableScan(table=[[default, t]], table:alias=[t]){noformat}
 

Consider now a scenario where expression reduction in projections is disabled 
by setting the following property{_}:{_}
{noformat}
set hive.cbo.rule.exclusion.regex=ReduceExpressionsRule\(Project\);
{noformat}
In this case, the simplification of _CAST(NULL)_ into _NULL_ does not happen, 
and we get the following (invalid) result:
{code:java}
1000    b1{code}
produced by the following invalid plan:
{code:java}
CBO PLAN:
HiveProject(a=[$0], b=[$1])
  HiveFilter(condition=[=(CAST($0):DOUBLE, 1000)])
    HiveTableScan(table=[[default, t]], table:alias=[t]) {code}
h1. Problem Analysis

At 
[HiveFilterSetOpTransposeRule.java#L112|https://github.com/apache/hive/blob/297f510d3b581c9d4079e42caa28aa84f8486012/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterSetOpTransposeRule.java#L112]
 the _RelMetadataQuery::getPulledUpPredicates_ method infers the following 
predicate due to the CAST(NULL) in the projection:
{code:java}
(=($1, CAST(null:NULL):VARCHAR(2147483647) CHARACTER SET "UTF-16LE")){code}
When the CAST is simplified to the NULL literal, the IS_NULL($1) predicate is 
inferred.

In 
[HiveFilterSetOpTransposeRule.java#L114-L122|https://github.com/apache/hive/blob/297f510d3b581c9d4079e42caa28aa84f8486012/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterSetOpTransposeRule.java#L114-L122],
 the rule checks if the conjunction of the predicate coming from the filter 
(here =(CAST($0):DOUBLE, 1000)) and the inferred predicates is satisfiable or 
not, under the _UnknownAsFalse_ semantics.

To summarize, the following expression is simplified under the _UnknownAsFalse_ 
semantics:
{code:java}
AND((=($1, CAST(null:NULL):VARCHAR(2147483647) CHARACTER SET "UTF-16LE")), 
=(CAST($0):DOUBLE, 1000))
{code}
Under In such semantics, (=($1, CAST(null:NULL):...) evaluates to {_}FALSE{_}, 
because no value is equal to NULL (even NULL itself), AND(FALSE,  
=(CAST($0):DOUBLE, 1000)) necessarily evaluates to _FALSE_ altogether, and the 
UNION ALL operand is pruned.

Only by chance, when _CAST(NULL)_ is simplified to _NULL,_ we avoid the issue, 
due to the _IS_NULL($1)_ inferred predicate, see 
[HiveRelMdPredicates.java#L153-L156|https://github.com/apache/hive/blob/297f510d3b581c9d4079e42caa28aa84f8486012/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdPredicates.java#L153-L156]
 for understanding how the NULL literal is treated differently during predicate 
inference.

_HiveRelMdPredicates_ should not use equality ('=') for nullable constant 
expressions, but rather IS NOT DISTINCT FROM, as detailed in HIVE-26733, but 
nonetheless the way simplification is done is not correct here, inferred 
predicates should be used as "context", rather than been used in a conjunctive 
expression, this usage does not conform with any of the similar uses of 
simplification with inferred predicates (see the bottom of the "Solution" 
section for examples and details).
h1. Solution

In order to correctly simplify a predicate and test if it's always false or 
not, we should build RexSimplify with _predicates_ as the list of predicates 
known to hold in the context. In this way, the different semantics are 
correctly taken into account.

The code at 
[HiveFilterSetOpTransposeRule.java#L114-L121|https://github.com/apache/hive/blob/297f510d3b581c9d4079e42caa28aa84f8486012/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterSetOpTransposeRule.java#L114-L121]
 should be replaced by the following:
{code:java}
final RexExecutor executor =
Util.first(filterRel.getCluster().getPlanner().getExecutor(), RexUtil.EXECUTOR);
final RexSimplify simplify = new RexSimplify(rexBuilder, predicates, executor);
final RexNode x = simplify.simplifyUnknownAs(newCondition, 
RexUnknownAs.FALSE);{code}
This is in line with other uses of simplification, like in Calcite:

[https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L704-L705]

You can see the above method used for different kind of rels, always in the way 
identical to what this ticket advocates:filters:
[https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L152-L155]

projects:
[https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L306-L310]

joins:
[https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L380-L383]

windows:
[https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L567-L576]

  was:
h1. Reproducer

Consider the following query:
{code:java}
set hive.cbo.rule.exclusion.regex=ReduceExpressionsRule\(Project\);

CREATE EXTERNAL TABLE t (a string, b string);
INSERT INTO t VALUES ('1000', 'b1');
INSERT INTO t VALUES ('2000', 'b2');

SELECT * FROM (
  SELECT
   a,
   b
  FROM t
   UNION ALL
  SELECT
   a,
   CAST(NULL AS string)
   FROM t) AS t2
WHERE a = 1000;EXPLAIN CBO
SELECT * FROM (
  SELECT
   a,
   b
  FROM t
   UNION ALL
  SELECT
   a,
   CAST(NULL AS string)
   FROM t) AS t2
WHERE a = 1000; {code}
 

The expected result is:
{code:java}
1000    b1
1000    NULL{code}
An example of correct plan is as follows:
{noformat}
CBO PLAN:
HiveUnion(all=[true])
  HiveProject(a=[$0], b=[$1])
    HiveFilter(condition=[=(CAST($0):DOUBLE, 1000)])
      HiveTableScan(table=[[default, t]], table:alias=[t])
  HiveProject(a=[$0], _o__c1=[null:VARCHAR(2147483647) CHARACTER SET 
"UTF-16LE"])
    HiveFilter(condition=[=(CAST($0):DOUBLE, 1000)])
      HiveTableScan(table=[[default, t]], table:alias=[t]){noformat}
 

Consider now a scenario where expression reduction in projections is disabled 
by setting the following property{_}:{_}
{noformat}
set hive.cbo.rule.exclusion.regex=ReduceExpressionsRule\(Project\);
{noformat}
In this case, the simplification of _CAST(NULL)_ into _NULL_ does not happen, 
and we get the following (invalid) result:
{code:java}
1000    b1{code}
produced by the following invalid plan:
{code:java}
CBO PLAN:
HiveProject(a=[$0], b=[$1])
  HiveFilter(condition=[=(CAST($0):DOUBLE, 1000)])
    HiveTableScan(table=[[default, t]], table:alias=[t]) {code}
h1. Problem Analysis

At 
[HiveFilterSetOpTransposeRule.java#L112|https://github.com/apache/hive/blob/297f510d3b581c9d4079e42caa28aa84f8486012/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterSetOpTransposeRule.java#L112]
 the _RelMetadataQuery::getPulledUpPredicates_ method infers the following 
predicate due to the CAST(NULL) in the projection:
{code:java}
(=($1, CAST(null:NULL):VARCHAR(2147483647) CHARACTER SET "UTF-16LE")){code}
When the CAST is simplified to the NULL literal, the IS_NULL($1) predicate is 
inferred.

In 
[HiveFilterSetOpTransposeRule.java#L114-L122|https://github.com/apache/hive/blob/297f510d3b581c9d4079e42caa28aa84f8486012/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterSetOpTransposeRule.java#L114-L122],
 the rule checks if the conjunction of the predicate coming from the filter 
(here =(CAST($0):DOUBLE, 1000)) and the inferred predicates is satisfiable or 
not, under the _UnknownAsFalse_ semantics.

To summarize, the following expression is simplified under the _UnknownAsFalse_ 
semantics:
{code:java}
AND((=($1, CAST(null:NULL):VARCHAR(2147483647) CHARACTER SET "UTF-16LE")), 
=(CAST($0):DOUBLE, 1000))
{code}
Under In such semantics, (=($1, CAST(null:NULL):...) evaluates to {_}FALSE{_}, 
because no value is equal to NULL (even NULL itself), AND(FALSE,  
=(CAST($0):DOUBLE, 1000)) necessarily evaluates to _FALSE_ altogether, and the 
UNION ALL operand is pruned.

Only by chance, when _CAST(NULL)_ is simplified to _NULL,_ we avoid the issue, 
due to the _IS_NULL($1)_ inferred predicate, see 
[HiveRelMdPredicates.java#L153-L156|https://github.com/apache/hive/blob/297f510d3b581c9d4079e42caa28aa84f8486012/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdPredicates.java#L153-L156]
 for understanding how the NULL literal is treated differently during predicate 
inference.

The problem lies in the fact that, depending on the input _RelNode_ that we 
infer predicates from, the semantics is not necessarily {_}UnknownAsFalse{_}, 
but it might be {_}UnknownAsUnknown{_}, like for {_}Project{_}, as in this case.
h1. Solution

In order to correctly simplify a predicate and test if it's always false or 
not, we should build RexSimplify with _predicates_ as the list of predicates 
known to hold in the context. In this way, the different semantics are 
correctly taken into account.

The code at 
[HiveFilterSetOpTransposeRule.java#L114-L121|https://github.com/apache/hive/blob/297f510d3b581c9d4079e42caa28aa84f8486012/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterSetOpTransposeRule.java#L114-L121]
 should be replaced by the following:
{code:java}
final RexExecutor executor =
Util.first(filterRel.getCluster().getPlanner().getExecutor(), RexUtil.EXECUTOR);
final RexSimplify simplify = new RexSimplify(rexBuilder, predicates, executor);
final RexNode x = simplify.simplifyUnknownAs(newCondition, 
RexUnknownAs.FALSE);{code}
This is in line with other uses of simplification, like in Calcite:

[https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L704-L705]

You can see the above method used for different kind of rels, always in the way 
identical to what this ticket advocates:filters:
[https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L152-L155]

projects:
[https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L306-L310]

joins:
[https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L380-L383]

windows:
[https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L567-L576]


> HiveFilterSetOpTransposeRule incorrectly prunes UNION ALL operands
> ------------------------------------------------------------------
>
>                 Key: HIVE-26722
>                 URL: https://issues.apache.org/jira/browse/HIVE-26722
>             Project: Hive
>          Issue Type: Bug
>          Components: CBO
>    Affects Versions: 4.0.0-alpha-1
>            Reporter: Alessandro Solimando
>            Assignee: Alessandro Solimando
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> h1. Reproducer
> Consider the following query:
> {code:java}
> set hive.cbo.rule.exclusion.regex=ReduceExpressionsRule\(Project\);
> CREATE EXTERNAL TABLE t (a string, b string);
> INSERT INTO t VALUES ('1000', 'b1');
> INSERT INTO t VALUES ('2000', 'b2');
> SELECT * FROM (
>   SELECT
>    a,
>    b
>   FROM t
>    UNION ALL
>   SELECT
>    a,
>    CAST(NULL AS string)
>    FROM t) AS t2
> WHERE a = 1000;EXPLAIN CBO
> SELECT * FROM (
>   SELECT
>    a,
>    b
>   FROM t
>    UNION ALL
>   SELECT
>    a,
>    CAST(NULL AS string)
>    FROM t) AS t2
> WHERE a = 1000; {code}
> The expected result is:
> {code:java}
> 1000    b1
> 1000    NULL{code}
> An example of correct plan is as follows:
> {noformat}
> CBO PLAN:
> HiveUnion(all=[true])
>   HiveProject(a=[$0], b=[$1])
>     HiveFilter(condition=[=(CAST($0):DOUBLE, 1000)])
>       HiveTableScan(table=[[default, t]], table:alias=[t])
>   HiveProject(a=[$0], _o__c1=[null:VARCHAR(2147483647) CHARACTER SET 
> "UTF-16LE"])
>     HiveFilter(condition=[=(CAST($0):DOUBLE, 1000)])
>       HiveTableScan(table=[[default, t]], table:alias=[t]){noformat}
>  
> Consider now a scenario where expression reduction in projections is disabled 
> by setting the following property{_}:{_}
> {noformat}
> set hive.cbo.rule.exclusion.regex=ReduceExpressionsRule\(Project\);
> {noformat}
> In this case, the simplification of _CAST(NULL)_ into _NULL_ does not happen, 
> and we get the following (invalid) result:
> {code:java}
> 1000    b1{code}
> produced by the following invalid plan:
> {code:java}
> CBO PLAN:
> HiveProject(a=[$0], b=[$1])
>   HiveFilter(condition=[=(CAST($0):DOUBLE, 1000)])
>     HiveTableScan(table=[[default, t]], table:alias=[t]) {code}
> h1. Problem Analysis
> At 
> [HiveFilterSetOpTransposeRule.java#L112|https://github.com/apache/hive/blob/297f510d3b581c9d4079e42caa28aa84f8486012/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterSetOpTransposeRule.java#L112]
>  the _RelMetadataQuery::getPulledUpPredicates_ method infers the following 
> predicate due to the CAST(NULL) in the projection:
> {code:java}
> (=($1, CAST(null:NULL):VARCHAR(2147483647) CHARACTER SET "UTF-16LE")){code}
> When the CAST is simplified to the NULL literal, the IS_NULL($1) predicate is 
> inferred.
> In 
> [HiveFilterSetOpTransposeRule.java#L114-L122|https://github.com/apache/hive/blob/297f510d3b581c9d4079e42caa28aa84f8486012/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterSetOpTransposeRule.java#L114-L122],
>  the rule checks if the conjunction of the predicate coming from the filter 
> (here =(CAST($0):DOUBLE, 1000)) and the inferred predicates is satisfiable or 
> not, under the _UnknownAsFalse_ semantics.
> To summarize, the following expression is simplified under the 
> _UnknownAsFalse_ semantics:
> {code:java}
> AND((=($1, CAST(null:NULL):VARCHAR(2147483647) CHARACTER SET "UTF-16LE")), 
> =(CAST($0):DOUBLE, 1000))
> {code}
> Under In such semantics, (=($1, CAST(null:NULL):...) evaluates to 
> {_}FALSE{_}, because no value is equal to NULL (even NULL itself), AND(FALSE, 
>  =(CAST($0):DOUBLE, 1000)) necessarily evaluates to _FALSE_ altogether, and 
> the UNION ALL operand is pruned.
> Only by chance, when _CAST(NULL)_ is simplified to _NULL,_ we avoid the 
> issue, due to the _IS_NULL($1)_ inferred predicate, see 
> [HiveRelMdPredicates.java#L153-L156|https://github.com/apache/hive/blob/297f510d3b581c9d4079e42caa28aa84f8486012/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/HiveRelMdPredicates.java#L153-L156]
>  for understanding how the NULL literal is treated differently during 
> predicate inference.
> _HiveRelMdPredicates_ should not use equality ('=') for nullable constant 
> expressions, but rather IS NOT DISTINCT FROM, as detailed in HIVE-26733, but 
> nonetheless the way simplification is done is not correct here, inferred 
> predicates should be used as "context", rather than been used in a 
> conjunctive expression, this usage does not conform with any of the similar 
> uses of simplification with inferred predicates (see the bottom of the 
> "Solution" section for examples and details).
> h1. Solution
> In order to correctly simplify a predicate and test if it's always false or 
> not, we should build RexSimplify with _predicates_ as the list of predicates 
> known to hold in the context. In this way, the different semantics are 
> correctly taken into account.
> The code at 
> [HiveFilterSetOpTransposeRule.java#L114-L121|https://github.com/apache/hive/blob/297f510d3b581c9d4079e42caa28aa84f8486012/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterSetOpTransposeRule.java#L114-L121]
>  should be replaced by the following:
> {code:java}
> final RexExecutor executor =
> Util.first(filterRel.getCluster().getPlanner().getExecutor(), 
> RexUtil.EXECUTOR);
> final RexSimplify simplify = new RexSimplify(rexBuilder, predicates, 
> executor);
> final RexNode x = simplify.simplifyUnknownAs(newCondition, 
> RexUnknownAs.FALSE);{code}
> This is in line with other uses of simplification, like in Calcite:
> [https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L704-L705]
> You can see the above method used for different kind of rels, always in the 
> way identical to what this ticket advocates:filters:
> [https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L152-L155]
> projects:
> [https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L306-L310]
> joins:
> [https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L380-L383]
> windows:
> [https://github.com/apache/calcite/blob/a0ce3275119f804959cda54d6e7a016ab893c359/core/src/main/java/org/apache/calcite/rel/rules/ReduceExpressionsRule.java#L567-L576]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to