Guram Savinov created SPARK-46516: ------------------------------------- Summary: autoBroadcastJoinThreshold compared to plan.statistics not a table size Key: SPARK-46516 URL: https://issues.apache.org/jira/browse/SPARK-46516 Project: Spark Issue Type: Documentation Components: SQL Affects Versions: 3.1.1 Reporter: Guram Savinov
>From the docs: spark.sql.autoBroadcastJoinThreshold - Configures the maximum >size in bytes for a table that will be broadcasted to all worker nodes when >performing a join. In fact Spark compares plan.statistics.sizeInBytes for columns selected in join, not a table size. The broadcasted table can be huge and leads to OOM on driver, so spark.sql.autoBroadcastJoinThreshold parameter seems useless when its not compared to broadcasted table sizes. [https://spark.apache.org/docs/3.5.0/configuration.html#runtime-sql-configuration] Related topic on SO: https://stackoverflow.com/questions/74435020/how-dataframe-count-selects-broadcasthashjoin-while-dataframe-show-selects-s -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org