nathanb9 opened a new pull request, #22810: URL: https://github.com/apache/datafusion/pull/22810
## Problem `NOT IN (subquery)` is a null-aware anti join: when the subquery yields a NULL the predicate is never TRUE, so the query must return zero rows. With `prefer_hash_join = false` and multiple partitions, the planner routed the null-aware anti join to `SortMergeJoinExec`, which is not null-aware, so it returned wrong results. HashJoin (the default) was already correct. ## Proof ```sql set datafusion.optimizer.prefer_hash_join = false; create table t1(x int) as values (1); create table t2(y int) as values (NULL); select x from t1 where x not in (select y from t2); ``` Expected 0 rows (the subquery contains a NULL). Before this change it returned `1`. With `prefer_hash_join = true` it correctly returned 0 rows. `EXPLAIN` showed the wrong config selecting `SortMergeJoinExec: join_type=LeftAnti`. ## Solution The planner already requires null-aware joins to use the CollectLeft HashJoin, and the HashJoin branch guards on `!null_aware`. The SortMergeJoin branch was missing the same guard, so this adds `&& !*null_aware` to it. Null-aware anti joins now fall through to the CollectLeft HashJoin regardless of `prefer_hash_join`. `SortMergeJoinExec` has no `null_aware` parameter and cannot honor these semantics. Added a regression test in `subquery.slt` (under `prefer_hash_join = false`) covering both a null-containing subquery (zero rows) and a null-free subquery (normal anti join). All 61 SortMergeJoin unit tests pass. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
