sunchao opened a new pull request, #55927: URL: https://github.com/apache/spark/pull/55927
### What changes were proposed in this pull request? This PR adds null-aware shuffle partitioning for ordinary outer equi-joins so that rows with unmatched `NULL` join keys can be spread across reducers instead of being concentrated in a single shuffle partition. The change keeps normal hash partitioning for non-NULL keys, applies only to ordinary `LEFT OUTER`, `RIGHT OUTER`, and `FULL OUTER` joins, and leaves null-safe equality (`<=>`) unchanged because NULLs may match there. The implementation also wires the new partitioning through shuffle compatibility checks and AQE/coalesced shuffle reads, and expands test coverage for join planning, result correctness, and deterministic retry behavior. ### Why are the changes needed? For ordinary outer equi-joins, rows with `NULL` join keys cannot match under `=` semantics, but Spark still hashes those rows together during shuffle planning. Large NULL-heavy outer-side inputs can therefore create avoidable reducer skew without any semantic benefit. Spreading only these semantically unmatched NULL-key rows reduces that skew pattern while preserving the join behavior for regular keys and for NULL-safe equality. ### Does this PR introduce _any_ user-facing change? Yes. Query results are unchanged, but shuffle partitioning for ordinary NULL-heavy outer equi-joins becomes less skewed. For example, in a `LEFT OUTER JOIN` where the left side contains many `NULL` keys and the predicate is `left.k = right.k`, those rows no longer need to converge on the same reducer. ### How was this patch tested? - Added and updated unit tests covering ordinary outer joins, FULL OUTER JOIN result correctness with NULL keys, exclusion of NULL-safe equality, shuffle-level NULL spreading, and deterministic retry behavior. - Ran `git diff --cached --check` before commit. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Codex GPT-5 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
