sunchao opened a new pull request, #55927:
URL: https://github.com/apache/spark/pull/55927

   ### What changes were proposed in this pull request?
   
   This PR adds null-aware shuffle partitioning for ordinary outer equi-joins 
so that rows with unmatched `NULL` join keys can be spread across reducers 
instead of being concentrated in a single shuffle partition.
   
   The change keeps normal hash partitioning for non-NULL keys, applies only to 
ordinary `LEFT OUTER`, `RIGHT OUTER`, and `FULL OUTER` joins, and leaves 
null-safe equality (`<=>`) unchanged because NULLs may match there.
   
   The implementation also wires the new partitioning through shuffle 
compatibility checks and AQE/coalesced shuffle reads, and expands test coverage 
for join planning, result correctness, and deterministic retry behavior.
   
   ### Why are the changes needed?
   
   For ordinary outer equi-joins, rows with `NULL` join keys cannot match under 
`=` semantics, but Spark still hashes those rows together during shuffle 
planning. Large NULL-heavy outer-side inputs can therefore create avoidable 
reducer skew without any semantic benefit.
   
   Spreading only these semantically unmatched NULL-key rows reduces that skew 
pattern while preserving the join behavior for regular keys and for NULL-safe 
equality.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. Query results are unchanged, but shuffle partitioning for ordinary 
NULL-heavy outer equi-joins becomes less skewed. For example, in a `LEFT OUTER 
JOIN` where the left side contains many `NULL` keys and the predicate is 
`left.k = right.k`, those rows no longer need to converge on the same reducer.
   
   ### How was this patch tested?
   
   - Added and updated unit tests covering ordinary outer joins, FULL OUTER 
JOIN result correctness with NULL keys, exclusion of NULL-safe equality, 
shuffle-level NULL spreading, and deterministic retry behavior.
   - Ran `git diff --cached --check` before commit.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Codex GPT-5


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to