Re: [PR] [SPARK-47271][DOCS] Explain importance of statistics on SQL performance tuning page [spark]

via GitHub Mon, 04 Mar 2024 17:58:08 -0800


nchammas commented on code in PR #45374:
URL: https://github.com/apache/spark/pull/45374#discussion_r1512030136



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -582,11 +582,7 @@ object SQLConf {
 
   val AUTO_BROADCASTJOIN_THRESHOLD = 
buildConf("spark.sql.autoBroadcastJoinThreshold")
     .doc("Configures the maximum size in bytes for a table that will be 
broadcast to all worker " +
-      "nodes when performing a join.  By setting this value to -1 broadcasting 
can be disabled. " +
-      "Note that currently statistics are only supported for Hive Metastore 
tables where the " +

Review Comment:
   Fair question. I removed it because I don't think it explains anything.
   
   Across all of Spark, statistics come from one of the three sources I 
described in this PR: data source, catalog, and runtime. And this applies to 
all cost-based optimizations, not just to auto-broadcast. Isn't that so?
   
   So I thought it would be better to remove this comment since it indirectly 
suggests that there is something special about auto-broadcast and statistics, 
when that isn't the case.
   
   But I confess I am concluding this based on a high-level understanding of 
the optimizer. I didn't dig in to the details of this particular optimization 
to see if there is anything really special about it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-47271][DOCS] Explain importance of statistics on SQL performance tuning page [spark]

Reply via email to