[ 
https://issues.apache.org/jira/browse/SPARK-56977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-56977:
-----------------------------------
    Labels: pull-request-available  (was: )

> Respect User Specified JoinType for NEAREST BY Join
> ---------------------------------------------------
>
>                 Key: SPARK-56977
>                 URL: https://issues.apache.org/jira/browse/SPARK-56977
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 4.2.0
>            Reporter: Zhidong Qu
>            Priority: Major
>              Labels: pull-request-available
>
> The original implementation of NEAREST BY JOIN hardcoded the synthetic join 
> to `LeftOuter` and justified it on the grounds that `LEFT OUTER` and `INNER` 
> are equivalent for an unconditioned join when the right side is non-empty, 
> and `Generate(outer = false)` would drop unwanted rows for INNER when right 
> is empty. While correct in result, this:             
>  * `INNER NEAREST BY` cannot be planned as a Cartesian product. Spark's 
> strategy picks `CartesianProductExec` only for `Inner` joins with no 
> condition; an unconditioned `LeftOuter` join falls back to 
> `BroadcastNestedLoopJoin`, which tries to broadcast the right side. When the 
> right relation is large, the broadcast either OOMs or exceeds 
> `spark.sql.autoBroadcastJoinThreshold` and the planner is left with no good 
> option. `CartesianProductExec` partitions both sides and streams pairs, so it 
> scales naturally with right-side size. Respecting the user's `INNER` join 
> type re-enables this strategy for the common `INNER NEAREST BY` case.
>  * Makes the EXPLAIN output misleading for `INNER NEAREST BY` (shows 
> `LeftOuter` in the underlying join even though the user wrote `INNER`).
>  * Produces a row per left input when `right` is empty under `INNER`, only to 
> drop those rows later via `Generate(outer = false)` and the `size(matches) > 
> 0` filter -- extra work that respecting `joinType` avoids at the source.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to