nathanb9 opened a new pull request, #22810:
URL: https://github.com/apache/datafusion/pull/22810

   ## Problem
   
   `NOT IN (subquery)` is a null-aware anti join: when the subquery yields a 
NULL the predicate is never TRUE, so the query must return zero rows. With 
`prefer_hash_join = false` and multiple partitions, the planner routed the 
null-aware anti join to `SortMergeJoinExec`, which is not null-aware, so it 
returned wrong results. HashJoin (the default) was already correct.
   
   ## Proof
   
   ```sql
   set datafusion.optimizer.prefer_hash_join = false;
   create table t1(x int) as values (1);
   create table t2(y int) as values (NULL);
   select x from t1 where x not in (select y from t2);
   ```
   
   Expected 0 rows (the subquery contains a NULL). Before this change it 
returned `1`. With `prefer_hash_join = true` it correctly returned 0 rows. 
`EXPLAIN` showed the wrong config selecting `SortMergeJoinExec: 
join_type=LeftAnti`.
   
   ## Solution
   
   The planner already requires null-aware joins to use the CollectLeft 
HashJoin, and the HashJoin branch guards on `!null_aware`. The SortMergeJoin 
branch was missing the same guard, so this adds `&& !*null_aware` to it. 
Null-aware anti joins now fall through to the CollectLeft HashJoin regardless 
of `prefer_hash_join`. `SortMergeJoinExec` has no `null_aware` parameter and 
cannot honor these semantics.
   
   Added a regression test in `subquery.slt` (under `prefer_hash_join = false`) 
covering both a null-containing subquery (zero rows) and a null-free subquery 
(normal anti join). All 61 SortMergeJoin unit tests pass.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to