[jira] [Updated] (SPARK-46516) autoBroadcastJoinThreshold compared to plan.statistics not a table size

Guram Savinov (Jira) Wed, 27 Dec 2023 07:06:05 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-46516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Guram Savinov updated SPARK-46516:
----------------------------------
    Description: 
>From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum 
>size in bytes for a table that will be broadcasted to all worker nodes when 
>performing a join.

[https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration]

In fact Spark compares plan.statistics.sizeInBytes for columns selected in 
join, not a table size.

[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368]

Join can select only a few columns and sizeInBytes will be lesser than 
autoBroadcastJoinThreshold, but broadcasted table can be huge and it is loaded 
entirely into drivers memory which can lead to OOM.

spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not 
compared to  broadcasted table size.

 

Original task and test when autobroadcast compared to relation totalSize:

https://issues.apache.org/jira/browse/SPARK-2393

[https://github.com/apache/spark/pull/1238/files#diff-00485e6cae519f81adca5ceee66227c6eae35db709619d505468f8765175ac18R39]

 

Task and PR where autoBroadcastJoinThreshold sterted to be compared to project 
of a plan instead of relations:

https://issues.apache.org/jira/browse/SPARK-13329

[https://github.com/apache/spark/pull/11210]

 

Related topic on SO: 
[https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s]

  was:
>From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum 
>size in bytes for a table that will be broadcasted to all worker nodes when 
>performing a join.

[https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration]

In fact Spark compares plan.statistics.sizeInBytes for columns selected in 
join, not a table size.

[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368]

Join can select only a few columns and sizeInBytes will be lesser than 
autoBroadcastJoinThreshold, but broadcasted table can be huge and it is loaded 
entirely into drivers memory which can lead to OOM.

spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not 
compared to  broadcasted table size.

 

Original task and test when autobroadcast compared to relation totalSize:

https://issues.apache.org/jira/browse/SPARK-2393

[https://github.com/apache/spark/pull/1238/files#diff-00485e6cae519f81adca5ceee66227c6eae35db709619d505468f8765175ac18R39]

 

Related topic on SO: 
[https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s]

 


> autoBroadcastJoinThreshold compared to plan.statistics not a table size
> -----------------------------------------------------------------------
>
>                 Key: SPARK-46516
>                 URL: https://issues.apache.org/jira/browse/SPARK-46516
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.1.1
>            Reporter: Guram Savinov
>            Priority: Major
>
> From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum 
> size in bytes for a table that will be broadcasted to all worker nodes when 
> performing a join.
> [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration]
> In fact Spark compares plan.statistics.sizeInBytes for columns selected in 
> join, not a table size.
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368]
> Join can select only a few columns and sizeInBytes will be lesser than 
> autoBroadcastJoinThreshold, but broadcasted table can be huge and it is 
> loaded entirely into drivers memory which can lead to OOM.
> spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not 
> compared to  broadcasted table size.
>  
> Original task and test when autobroadcast compared to relation totalSize:
> https://issues.apache.org/jira/browse/SPARK-2393
> [https://github.com/apache/spark/pull/1238/files#diff-00485e6cae519f81adca5ceee66227c6eae35db709619d505468f8765175ac18R39]
>  
> Task and PR where autoBroadcastJoinThreshold sterted to be compared to 
> project of a plan instead of relations:
> https://issues.apache.org/jira/browse/SPARK-13329
> [https://github.com/apache/spark/pull/11210]
>  
> Related topic on SO: 
> [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46516) autoBroadcastJoinThreshold compared to plan.statistics not a table size

Reply via email to