Chao Sun created SPARK-56899:
--------------------------------
Summary: Spread NULL outer join keys across shuffle partitions
Key: SPARK-56899
URL: https://issues.apache.org/jira/browse/SPARK-56899
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 4.2.0
Reporter: Chao Sun
Spark currently hashes all join keys, including NULL keys, when planning
shuffle exchanges for ordinary outer equi-joins. For rows whose join key is
NULL, ordinary equality predicates such as left.k = right.k can never match.
However, those rows still hash to the same shuffle partition, which can create
avoidable reducer skew when an outer-side input contains many NULL keys.
Why this should change:
- Non-NULL keys still need normal hash co-location so matching rows meet in the
same reducer.
- NULL-key rows under ordinary equality are semantically unmatched, so they do
not need to be co-located with each other.
- Spreading only those NULL-key rows across shuffle partitions preserves query
semantics while reducing a common skew pattern for outer joins.
Examples:
1. LEFT OUTER JOIN
Suppose the left side contains millions of rows where k IS NULL, and the
join predicate is left.k = right.k. Those rows can never match the right side,
but today they are funneled into the same shuffle partition. They should be
spread across partitions instead.
2. FULL OUTER JOIN
If both sides contain many NULL keys and the join predicate is ordinary
equality, NULL-key rows on both sides remain unmatched. They likewise do not
need to gather in one reducer.
3. NULL-safe equality must stay unchanged
For predicates such as left.k <=> right.k, NULL values can match. Those rows
must remain co-located, so this optimization must not apply.
The intended behavior is to preserve standard hash partitioning for non-NULL
keys while using a null-aware shuffle partitioning strategy for ordinary LEFT
OUTER, RIGHT OUTER, and FULL OUTER joins. The change also needs to remain
compatible with AQE/coalesced shuffle reads and keep retry behavior
deterministic for the null-spreading path.
Test coverage should include:
- ordinary outer joins using the optimization,
- FULL OUTER JOIN result correctness with NULL keys,
- NULL-safe equality remaining excluded,
- direct shuffle-level coverage that NULL-key rows are spread across reducers,
- deterministic/retry behavior for the null-aware shuffle path.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]