I’m interested in improving some of the documentation relating to the table and column statistics that get stored in the metastore, and how Spark uses them.
But I’m not clear on a few things, so I’m writing to you with some questions. 1. The documentation for spark.sql.autoBroadcastJoinThreshold <https://spark.apache.org/docs/latest/sql-performance-tuning.html> implies that it depends on table statistics to work, but it’s not clear. Is it accurate to say that unless you have run ANALYZE on the tables participating in a join, spark.sql.autoBroadcastJoinThreshold cannot impact the execution plan? 2. As a follow-on to the above question, the adaptive version of autoBroadcastJoinThreshold, namely spark.sql.adaptive.autoBroadcastJoinThreshold, may still kick in, because it depends only on runtime statistics and not statistics in the metastore. Is that correct? I am assuming that “runtime statistics” are gathered on the fly by Spark, but I would like to mention this in the docs briefly somewhere. 3. The documentation for spark.sql.inMemoryColumnarStorage.compressed <https://spark.apache.org/docs/latest/sql-performance-tuning.html> mentions “statistics”, but it’s not clear what kind of statistics we’re talking about. Are those runtime statistics, metastore statistics (that depend on you running ANALYZE), or something else? 4. The documentation for ANALYZE TABLE <https://spark.apache.org/docs/latest/sql-ref-syntax-aux-analyze-table.html> states that the collected statistics help the optimizer "find a better query execution plan”. I wish we could link to something from here with more explanation. Currently, spark.sql.autoBroadcastJoinThreshold is the only place where metastore statistics are explicitly referenced as impacting the execution plan. Surely there must be other places, no? Would it be appropriate to mention the cost-based optimizer framework <https://issues.apache.org/jira/browse/SPARK-16026> somehow? It doesn’t appear to have any public documentation outside of Jira. Any pointers or information you can provide would be very helpful. Again, I am interested in contributing some documentation improvements relating to statistics, but there is a lot I’m not sure about. Nick