Zhidong Qu created SPARK-56977:
----------------------------------

             Summary: Respect User Specified JoinType for NEAREST BY Join
                 Key: SPARK-56977
                 URL: https://issues.apache.org/jira/browse/SPARK-56977
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 4.2.0
            Reporter: Zhidong Qu


The original implementation of NEAREST BY JOIN hardcoded the synthetic join to 
`LeftOuter` and justified it on the grounds that `LEFT OUTER` and `INNER` are 
equivalent for an unconditioned join when the right side is non-empty, and 
`Generate(outer = false)` would drop unwanted rows for INNER when right is 
empty. While correct in result, this:             
 * `INNER NEAREST BY` cannot be planned as a Cartesian product. Spark's 
strategy picks `CartesianProductExec` only for `Inner` joins with no condition; 
an unconditioned `LeftOuter` join falls back to `BroadcastNestedLoopJoin`, 
which tries to broadcast the right side. When the right relation is large, the 
broadcast either OOMs or exceeds `spark.sql.autoBroadcastJoinThreshold` and the 
planner is left with no good option. `CartesianProductExec` partitions both 
sides and streams pairs, so it scales naturally with right-side size. 
Respecting the user's `INNER` join type re-enables this strategy for the common 
`INNER NEAREST BY` case.
 * Makes the EXPLAIN output misleading for `INNER NEAREST BY` (shows 
`LeftOuter` in the underlying join even though the user wrote `INNER`).
 * Produces a row per left input when `right` is empty under `INNER`, only to 
drop those rows later via `Generate(outer = false)` and the `size(matches) > 0` 
filter -- extra work that respecting `joinType` avoids at the source.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to