When and how does Spark use metastore statistics?

Nicholas Chammas Tue, 05 Dec 2023 09:12:44 -0800

I’m interested in improving some of the documentation relating to the table and 
column statistics that get stored in the metastore, and how Spark uses them.


But I’m not clear on a few things, so I’m writing to you with some questions.

1. The documentation for spark.sql.autoBroadcastJoinThreshold 
<https://spark.apache.org/docs/latest/sql-performance-tuning.html> implies that 
it depends on table statistics to work, but it’s not clear. Is it accurate to 
say that unless you have run ANALYZE on the tables participating in a join, 
spark.sql.autoBroadcastJoinThreshold cannot impact the execution plan?

2. As a follow-on to the above question, the adaptive version of 
autoBroadcastJoinThreshold, namely 
spark.sql.adaptive.autoBroadcastJoinThreshold, may still kick in, because it 
depends only on runtime statistics and not statistics in the metastore. Is that 
correct? I am assuming that “runtime statistics” are gathered on the fly by 
Spark, but I would like to mention this in the docs briefly somewhere.

3. The documentation for spark.sql.inMemoryColumnarStorage.compressed 
<https://spark.apache.org/docs/latest/sql-performance-tuning.html> mentions 
“statistics”, but it’s not clear what kind of statistics we’re talking about. 
Are those runtime statistics, metastore statistics (that depend on you running 
ANALYZE), or something else?

4. The documentation for ANALYZE TABLE 
<https://spark.apache.org/docs/latest/sql-ref-syntax-aux-analyze-table.html> 
states that the collected statistics help the optimizer "find a better query 
execution plan”. I wish we could link to something from here with more 
explanation. Currently, spark.sql.autoBroadcastJoinThreshold is the only place 
where metastore statistics are explicitly referenced as impacting the execution 
plan. Surely there must be other places, no? Would it be appropriate to mention 
the cost-based optimizer framework 
<https://issues.apache.org/jira/browse/SPARK-16026> somehow? It doesn’t appear 
to have any public documentation outside of Jira.

Any pointers or information you can provide would be very helpful. Again, I am 
interested in contributing some documentation improvements relating to 
statistics, but there is a lot I’m not sure about.

Nick

When and how does Spark use metastore statistics?

Reply via email to