Guram Savinov created SPARK-46516:
-------------------------------------

             Summary: autoBroadcastJoinThreshold compared to plan.statistics 
not a table size
                 Key: SPARK-46516
                 URL: https://issues.apache.org/jira/browse/SPARK-46516
             Project: Spark
          Issue Type: Documentation
          Components: SQL
    Affects Versions: 3.1.1
            Reporter: Guram Savinov


>From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum 
>size in bytes for a table that will be broadcasted to all worker nodes when 
>performing a join.
In fact Spark compares plan.statistics.sizeInBytes for columns selected in 
join, not a table size.
The broadcasted table can be huge and leads to OOM on driver, so 
spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not 
compared to  broadcasted table sizes.

[https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration]

Related topic on SO: 
https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to