[ https://issues.apache.org/jira/browse/SPARK-43413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jack Chen updated SPARK-43413: ------------------------------ Description: IN subquery expressions are incorrectly marked as non-nullable, even when they are actually nullable. They correctly check the nullability of the left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is currently defined with nullability = false always. This is incorrect and can lead to incorrect query transformations. Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN expression returns NULL when the nullable_col is null, but our code marks it as non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> TRUE, transforming the expression to non_nullable_col IN (select nullable_col) , which is an incorrect transformation because NULL values of nullable_col now cause the expression to yield NULL instead of FALSE. This bug can potentially lead to wrong results, but in most cases this doesn't directly cause wrong results end-to-end, because IN subqueries are almost always transformed to semi/anti/existence joins in RewritePredicateSubquery, and this rewrite can also incorrectly discard NULLs, which is another bug. But we can observe it causing wrong behavior in unit tests, and it could easily lead to incorrect query results if there are changes to the surrounding context, so it should be fixed regardless. This is a long-standing bug that has existed at least since 2016, as long as the ListQuery class has existed. was: IN subquery expressions are incorrectly always marked as non-nullable, even when they are actually nullable. They correctly check the nullability of the left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is currently defined with nullability = false always. This is incorrect and can lead to incorrect query transformations. Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN expression returns NULL when the nullable_col is null, but our code marks it as non-nullable, and therefore SimplifyBinaryComparison transforms away the <=> TRUE, transforming the expression to non_nullable_col IN (select nullable_col) , which is an incorrect transformation because NULL values of nullable_col now cause the expression to yield NULL instead of FALSE. This bug can potentially lead to wrong results, but in most cases this doesn't directly cause wrong results end-to-end, because IN subqueries are almost always transformed to semi/anti/existence joins in RewritePredicateSubquery, and this rewrite can also incorrectly discard NULLs, which is another bug. But we can observe it causing wrong behavior in unit tests, and it could easily lead to incorrect query results if there are changes to the surrounding context, so it should be fixed regardless. This is a long-standing bug that has existed at least since 2016, as long as the ListQuery class has existed. > IN subquery ListQuery has wrong nullability > ------------------------------------------- > > Key: SPARK-43413 > URL: https://issues.apache.org/jira/browse/SPARK-43413 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.4.0 > Reporter: Jack Chen > Priority: Major > > IN subquery expressions are incorrectly marked as non-nullable, even when > they are actually nullable. They correctly check the nullability of the > left-hand-side, but the right-hand-side of a IN subquery, the ListQuery, is > currently defined with nullability = false always. This is incorrect and can > lead to incorrect query transformations. > Example: (non_nullable_col IN (select nullable_col)) <=> TRUE . Here the IN > expression returns NULL when the nullable_col is null, but our code marks it > as non-nullable, and therefore SimplifyBinaryComparison transforms away the > <=> TRUE, transforming the expression to non_nullable_col IN (select > nullable_col) , which is an incorrect transformation because NULL values of > nullable_col now cause the expression to yield NULL instead of FALSE. > This bug can potentially lead to wrong results, but in most cases this > doesn't directly cause wrong results end-to-end, because IN subqueries are > almost always transformed to semi/anti/existence joins in > RewritePredicateSubquery, and this rewrite can also incorrectly discard > NULLs, which is another bug. But we can observe it causing wrong behavior in > unit tests, and it could easily lead to incorrect query results if there are > changes to the surrounding context, so it should be fixed regardless. > This is a long-standing bug that has existed at least since 2016, as long as > the ListQuery class has existed. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org