[ 
https://issues.apache.org/jira/browse/SPARK-43413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-43413:
------------------------------
    Description: 
IN subquery expressions are incorrectly always marked as non-nullable, even 
when they are actually nullable. They correctly check the nullability of the 
left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is 
currently defined with nullability = false always. This is incorrect and can 
lead to incorrect query transformations.

Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN 
expression returns NULL when the nullable_col is null, but our code marks it as 
non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> 
TRUE, transforming the expression to non_nullable_col IN (select nullable_col) 
, which is an incorrect transformation because NULL values of nullable_col now 
cause the expression to yield NULL instead of FALSE.

This bug can potentially lead to wrong results, but in most cases this doesn't 
directly cause wrong results end-to-end, because IN subqueries are almost 
always transformed to semi/anti/existence joins in RewritePredicateSubquery, 
and this rewrite can also incorrectly discard NULLs, which is another bug. But 
we can observe it causing wrong behavior in unit tests, and it could easily 
lead to incorrect query results if there are changes to the surrounding 
context, so it should be fixed regardless.

This is a long-standing bug that has existed at least since 2016, as long as 
the ListQuery class has existed.

  was:
IN subquery expressions currently are marked as nullable if and only if the 
left-hand-side is nullable - because the right-hand-side of a IN subquery, the 
ListQuery, is currently defined with nullability = false always. This is 
incorrect and can lead to incorrect query transformations.

Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN 
expression returns NULL when the nullable_col is null, but our code marks it as 
non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> 
TRUE, transforming the expression to non_nullable_col IN (select nullable_col) 
, which is an incorrect transformation because NULL values of nullable_col now 
cause the expression to yield NULL instead of FALSE.

This bug can potentially lead to wrong results, but in most cases this doesn't 
directly cause wrong results end-to-end, because IN subqueries are almost 
always transformed to semi/anti/existence joins in RewritePredicateSubquery, 
and this rewrite can also incorrectly discard NULLs, which is another bug. But 
we can observe it causing wrong behavior in unit tests, and it could easily 
lead to incorrect query results if there are changes to the surrounding 
context, so it should be fixed regardless.

This is a long-standing bug that has existed at least since 2016, as long as 
the ListQuery class has existed.


> IN subquery ListQuery has wrong nullability
> -------------------------------------------
>
>                 Key: SPARK-43413
>                 URL: https://issues.apache.org/jira/browse/SPARK-43413
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.4.0
>            Reporter: Jack Chen
>            Priority: Major
>
> IN subquery expressions are incorrectly always marked as non-nullable, even 
> when they are actually nullable. They correctly check the nullability of the 
> left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is 
> currently defined with nullability = false always. This is incorrect and can 
> lead to incorrect query transformations.
> Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN 
> expression returns NULL when the nullable_col is null, but our code marks it 
> as non-nullable, and therefore SimplifyBinaryComparison transforms away the 
> <=> TRUE, transforming the expression to non_nullable_col IN (select 
> nullable_col) , which is an incorrect transformation because NULL values of 
> nullable_col now cause the expression to yield NULL instead of FALSE.
> This bug can potentially lead to wrong results, but in most cases this 
> doesn't directly cause wrong results end-to-end, because IN subqueries are 
> almost always transformed to semi/anti/existence joins in 
> RewritePredicateSubquery, and this rewrite can also incorrectly discard 
> NULLs, which is another bug. But we can observe it causing wrong behavior in 
> unit tests, and it could easily lead to incorrect query results if there are 
> changes to the surrounding context, so it should be fixed regardless.
> This is a long-standing bug that has existed at least since 2016, as long as 
> the ListQuery class has existed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to