[
https://issues.apache.org/jira/browse/SPARK-56767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
James Xu resolved SPARK-56767.
------------------------------
Resolution: Invalid
The description is not very accurate, will close it and open a new one.
> Optimize hot-key skew joins by splitting into a broadcast branch for hot keys
> and a sort-merge branch for the rest
> ------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-56767
> URL: https://issues.apache.org/jira/browse/SPARK-56767
> Project: Spark
> Issue Type: Improvement
> Components: Optimizer
> Affects Versions: 4.1.1
> Reporter: James Xu
> Priority: Major
> Attachments: image-2026-05-08-17-11-58-881.png
>
>
> h2. What is the problem?
> Data skew in shuffle joins is one of the most common causes of long-running
> or stalled Spark jobs. When a small number of join key values ("hot keys")
> account for a disproportionate share of rows on one side of a join, the
> shuffle partitions handling those keys become stragglers — they process far
> more data than average partitions, dominating overall job latency regardless
> of cluster size.
> Spark's AQE-based skew join optimization (OptimizeSkewedJoin) detects skewed
> partitions at runtime and splits them by reading from multiple map output
> ranges. *However, this approach has a fundamental limitation: if skew
> originates from a small number of high-frequency key values that dominate
> entire map-side output ranges — for example when the input data is pre-sorted
> or clustered by key, causing all rows for a hot key to land in contiguous map
> output — those rows cannot be split across tasks regardless of AQE
> thresholds.* AQE has no visibility into key-level frequency, only
> partition-level byte sizes, so it cannot route hot-key rows to a broadcast
> join where they could be processed efficiently.
> h2. Why does this deserve solving?
> Hot-key skew is a well-known and frequently reported production problem. The
> standard workaround — salting the join key — requires rewriting the query,
> adds complexity, and is error-prone. A hint-driven, transparent optimization
> that requires no query rewriting would meaningfully reduce the operational
> burden for users working with skewed datasets (e.g. joins involving user IDs,
> product IDs, or any high-cardinality column with a power-law distribution).
> The optimization is also naturally bounded in scope: it only fires when the
> user explicitly opts in via a hint, so there is no risk of silent regression
> for existing queries.
> h2. How are we planning to solve this?
> We introduce a new SKEW_JOIN(tableName(col1, col2, ...)) hint that triggers a
> hot-key broadcast split at query planning time:
> 1. Hot-key sampling. At planning time, Spark executes a lightweight sampling
> subquery against the hinted (skewed) table. The subquery groups by the hinted
> columns, counts occurrences per key, and returns the top-N keys whose count
> exceeds a configurable threshold (scaled by the sampling fraction). Sampling
> is done via TABLESAMPLE to limit overhead on large tables.
> 2. Plan split. If hot keys are found, the join is split into two branches:
> - Hot branch: A BroadcastHashJoinExec that processes only rows whose join
> key matches a hot key. The non-skewed side is broadcast after being filtered
> to matching hot-key rows.
> - Normal branch: A SortMergeJoinExec that processes all remaining rows
> (skewed side filtered to exclude hot keys).
> - The results of both branches are combined with a union.
> 3. Fallback. If sampling finds no hot keys, or if a type mismatch prevents
> safe broadcast filtering, the operator transparently falls back to a standard
> SortMergeJoinExec with no change to query results.
> The feature supports inner, left outer, right outer, left semi, and left anti
> joins. It is compatible with AQE. NULL join keys are always routed through
> the normal branch.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]