Zhidong Qu created SPARK-56977:
----------------------------------
Summary: Respect User Specified JoinType for NEAREST BY Join
Key: SPARK-56977
URL: https://issues.apache.org/jira/browse/SPARK-56977
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 4.2.0
Reporter: Zhidong Qu
The original implementation of NEAREST BY JOIN hardcoded the synthetic join to
`LeftOuter` and justified it on the grounds that `LEFT OUTER` and `INNER` are
equivalent for an unconditioned join when the right side is non-empty, and
`Generate(outer = false)` would drop unwanted rows for INNER when right is
empty. While correct in result, this:
* `INNER NEAREST BY` cannot be planned as a Cartesian product. Spark's
strategy picks `CartesianProductExec` only for `Inner` joins with no condition;
an unconditioned `LeftOuter` join falls back to `BroadcastNestedLoopJoin`,
which tries to broadcast the right side. When the right relation is large, the
broadcast either OOMs or exceeds `spark.sql.autoBroadcastJoinThreshold` and the
planner is left with no good option. `CartesianProductExec` partitions both
sides and streams pairs, so it scales naturally with right-side size.
Respecting the user's `INNER` join type re-enables this strategy for the common
`INNER NEAREST BY` case.
* Makes the EXPLAIN output misleading for `INNER NEAREST BY` (shows
`LeftOuter` in the underlying join even though the user wrote `INNER`).
* Produces a row per left input when `right` is empty under `INNER`, only to
drop those rows later via `Generate(outer = false)` and the `size(matches) > 0`
filter -- extra work that respecting `joinType` avoids at the source.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]