I’ve done some reading and have a slightly better understanding of statistics 
now.

Every implementation of LeafNode.computeStats 
<https://github.com/apache/spark/blob/7cea52c96f5be1bc565a033bfd77370ab5527a35/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L210>
 offers its own way to get statistics:

LocalRelation 
<https://github.com/apache/spark/blob/8ff6b7a04cbaef9c552789ad5550ceab760cb078/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LocalRelation.scala#L97>
 estimates the size of the relation directly from the row count.
HiveTableRelation 
<https://github.com/apache/spark/blob/8e95929ac4238d02dca379837ccf2fbc1cd1926d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L923-L929>
 pulls those statistics from the catalog or metastore.
DataSourceV2Relation 
<https://github.com/apache/spark/blob/5fec76dc8db2499b0a9d76231f9a250871d59658/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala#L100>
 delegates the job of computing statistics to the underlying data source.
There are a lot of details I’m still fuzzy on, but I think that’s the gist of 
things.

Would it make sense to add a paragraph or two to the SQL performance tuning 
page <https://spark.apache.org/docs/latest/sql-performance-tuning.html> 
covering statistics at a high level? Something that briefly explains:

what statistics are and how Spark uses them to optimize plans
the various ways Spark computes or loads statistics (catalog, data source, 
runtime, etc.)
how to gather catalog statistics (i.e. pointer to ANALYZE TABLE)
how to check statistics on an object (i.e. DESCRIBE EXTENDED) and as part of an 
optimized plan (i.e. .explain(mode="cost"))
what the cost-based optimizer does and how to enable it
Would this be a welcome addition to the project’s documentation? I’m happy to 
work on this.



> On Dec 5, 2023, at 12:12 PM, Nicholas Chammas <nicholas.cham...@gmail.com> 
> wrote:
> 
> I’m interested in improving some of the documentation relating to the table 
> and column statistics that get stored in the metastore, and how Spark uses 
> them.
> 
> But I’m not clear on a few things, so I’m writing to you with some questions.
> 
> 1. The documentation for spark.sql.autoBroadcastJoinThreshold 
> <https://spark.apache.org/docs/latest/sql-performance-tuning.html> implies 
> that it depends on table statistics to work, but it’s not clear. Is it 
> accurate to say that unless you have run ANALYZE on the tables participating 
> in a join, spark.sql.autoBroadcastJoinThreshold cannot impact the execution 
> plan?
> 
> 2. As a follow-on to the above question, the adaptive version of 
> autoBroadcastJoinThreshold, namely 
> spark.sql.adaptive.autoBroadcastJoinThreshold, may still kick in, because it 
> depends only on runtime statistics and not statistics in the metastore. Is 
> that correct? I am assuming that “runtime statistics” are gathered on the fly 
> by Spark, but I would like to mention this in the docs briefly somewhere.
> 
> 3. The documentation for spark.sql.inMemoryColumnarStorage.compressed 
> <https://spark.apache.org/docs/latest/sql-performance-tuning.html> mentions 
> “statistics”, but it’s not clear what kind of statistics we’re talking about. 
> Are those runtime statistics, metastore statistics (that depend on you 
> running ANALYZE), or something else?
> 
> 4. The documentation for ANALYZE TABLE 
> <https://spark.apache.org/docs/latest/sql-ref-syntax-aux-analyze-table.html> 
> states that the collected statistics help the optimizer "find a better query 
> execution plan”. I wish we could link to something from here with more 
> explanation. Currently, spark.sql.autoBroadcastJoinThreshold is the only 
> place where metastore statistics are explicitly referenced as impacting the 
> execution plan. Surely there must be other places, no? Would it be 
> appropriate to mention the cost-based optimizer framework 
> <https://issues.apache.org/jira/browse/SPARK-16026> somehow? It doesn’t 
> appear to have any public documentation outside of Jira.
> 
> Any pointers or information you can provide would be very helpful. Again, I 
> am interested in contributing some documentation improvements relating to 
> statistics, but there is a lot I’m not sure about.
> 
> Nick
> 

Reply via email to