Chao Sun created SPARK-57282:
--------------------------------
Summary: Spread NULL left anti join keys across shuffle partitions
Key: SPARK-57282
URL: https://issues.apache.org/jira/browse/SPARK-57282
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 4.2.0
Reporter: Chao Sun
Assignee: Chao Sun
SPARK-56903 added feature-flagged NULL-key spreading for shuffled outer
equi-joins. The review discussion explicitly identified ordinary LEFT ANTI
joins as a follow-up because the same structural condition applies: left-side
rows with NULL keys cannot match under `=`, but they must be emitted by the
anti join. Hash partitioning currently sends all of those rows to one reducer,
causing avoidable skew on NULL-heavy inputs.
Extend `spark.sql.shuffle.spreadNullJoinKeys.enabled` to ordinary shuffled LEFT
ANTI equi-joins when the preserved left-side join keys are nullable. Non-NULL
keys should retain normal hash partitioning, while NULL-keyed rows can use the
existing null-aware shuffle layout. LEFT SEMI and other existence joins remain
out of scope because they do not emit unmatched left rows.
Add correctness and physical-partitioning coverage for sort-merge join and
shuffled hash join, plus AQE coverage for coalesced null-aware partitioning.
Related discussion:
https://github.com/apache/spark/pull/55927#discussion_r3261128231
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]