[ https://issues.apache.org/jira/browse/SPARK-46516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Guram Savinov updated SPARK-46516: ---------------------------------- Description: >From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum >size in bytes for a table that will be broadcasted to all worker nodes when >performing a join. [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration] In fact Spark compares plan.statistics.sizeInBytes for columns selected in join, not a table size. [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368] Join can select only a few columns and sizeInBytes will be lesser than autoBroadcastJoinThreshold, but broadcasted table can be huge and leads to OOM on driver. spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not compared to broadcasted table size. Related topic on SO: [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s] was: >From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum >size in bytes for a table that will be broadcasted to all worker nodes when >performing a join. [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration] In fact Spark compares plan.statistics.sizeInBytes for columns selected in join, not a table size. [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368] Join can select only a few columns in join and sizeInBytes will be lesser than autoBroadcastJoinThreshold, but broadcasted table can be huge and leads to OOM on driver. spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not compared to broadcasted table size. Related topic on SO: [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s] > autoBroadcastJoinThreshold compared to plan.statistics not a table size > ----------------------------------------------------------------------- > > Key: SPARK-46516 > URL: https://issues.apache.org/jira/browse/SPARK-46516 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.1.1 > Reporter: Guram Savinov > Priority: Major > > From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum > size in bytes for a table that will be broadcasted to all worker nodes when > performing a join. > [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration] > In fact Spark compares plan.statistics.sizeInBytes for columns selected in > join, not a table size. > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L368] > Join can select only a few columns and sizeInBytes will be lesser than > autoBroadcastJoinThreshold, but broadcasted table can be huge and leads to > OOM on driver. > spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not > compared to broadcasted table size. > Related topic on SO: > [https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s] -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org