[ 
https://issues.apache.org/jira/browse/SPARK-45846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048947#comment-18048947
 ] 

Dipanshu Pandey commented on SPARK-45846:
-----------------------------------------

I have successfully replicated this issue. The NAAJ (Null-Aware Anti Join) 
optimization 
ignores the spark.sql.autoBroadcastJoinThreshold configuration, forcing a 
broadcast join 
even when the threshold is set to -1 (broadcast disabled).

Reproduction Steps:
1. Create tables with nullable columns
2. Set spark.sql.autoBroadcastJoinThreshold = -1
3. Set spark.sql.optimizeNullAwareAntiJoin = true
4. Execute a NOT IN subquery

Example Query:


{code:java}
CREATE OR REPLACE TEMP VIEW t1 AS SELECT * FROM VALUES (1), (2), (null) AS t(a);
CREATE OR REPLACE TEMP VIEW t2 AS SELECT * FROM VALUES (1), (null) AS t(b);
SET spark.sql.autoBroadcastJoinThreshold = -1;
SET spark.sql.optimizeNullAwareAntiJoin = true;
SELECT * FROM t1 WHERE a NOT IN (SELECT b FROM t2);
{code}
 

Observed Behavior:
The query plan still shows BroadcastHashJoin despite threshold = -1:


{code:java}
+- BroadcastHashJoin [a#9], [b#11], LeftAnti, BuildRight, true
   :- LocalTableScan [a#9]
   +- BroadcastExchange HashedRelationBroadcastMode(...)
      +- LocalTableScan [b#11]{code}
Expected Behavior:
When autoBroadcastJoinThreshold = -1, the query should fall back to a 
non-broadcast 
join strategy (e.g., BroadcastNestedLoopJoin or SortMergeJoin).

Root Cause:
The NAAJ pattern matching in SparkStrategies.scala (line 331-333) 
unconditionally 
creates a BroadcastHashJoinExec without checking canBroadcastBySize().

I will be working on a fix for this issue.

> spark.sql.optimizeNullAwareAntiJoin should respect 
> spark.sql.autoBroadcastJoinThreshold
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-45846
>                 URL: https://issues.apache.org/jira/browse/SPARK-45846
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.0.0
>            Reporter: Chao Sun
>            Priority: Major
>
> Normally broadcast join can be disabled when users set 
> {{spark.sql.autoBroadcastJoinThreshold}} to -1. However this doesn't apply to 
> {{spark.sql.optimizeNullAwareAntiJoin}}:
> {code}
>       case j @ ExtractSingleColumnNullAwareAntiJoin(leftKeys, rightKeys) =>
>         Seq(joins.BroadcastHashJoinExec(leftKeys, rightKeys, LeftAnti, 
> BuildRight,
>           None, planLater(j.left), planLater(j.right), isNullAwareAntiJoin = 
> true))
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to