Some of these have been around outside of spark for years. like CBO and RBO etc but I concur that they have a place in spark's doc.
Simply put, statistics provide insights into the characteristics of data, such as distribution, skewness, and cardinalities, which help the optimizer make informed decisions about data partitioning, aggregation strategies, and join order. Not so differently, Spark utilizes statistics to: - Partition Data Effectively: Spark partitions data into smaller chunks to distribute and parallelize computations across worker nodes. Accurate statistics enable the optimizer to choose the most appropriate partitioning strategy for each data set, considering factors like data distribution and skewness. - Optimize Join Operations: Spark employs statistics to determine the most efficient join order, considering the join factors and their respective cardinalities. This helps reduce the amount of data shuffled during joins, improving performance and minimizing data transfer overhead. - Choose Optimal Aggregation Strategies: When performing aggregations, Spark uses statistics to determine the most efficient aggregation algorithm based on the data distribution and the desired aggregation functions. This ensures that aggregations are performed efficiently without compromising accuracy. With regard to type of statistics: - Catalog Statistics: These are pre-computed statistics that are stored in the Spark SQL catalog and associated with table or dataset metadata. They are typically gathered using the ANALYZE TABLE statement or through data source-specific mechanisms. - Data Source Statistics: These statistics are computed by the data source itself, such as Parquet or Hive, and are associated with the internal format of the data. Spark can access and utilize these statistics when working with external data sources. - Runtime Statistics: These are statistics that are dynamically computed during query execution. Spark can gather runtime statistics for certain operations, such as aggregations or joins, to refine its optimization decisions based on the actual data encountered. It is important to mention Cost-Based Optimization (CBO). CBO in Spark analyzes the query plan and estimates the execution costs associated with each operation. It uses statistics to guide its decisions, selecting the plan with the lowest estimated cost. I do not know any RDBMS that uses rule based optimizer (RBO) anymore. By default, the CBO is enabled in Spark. However, you can explicitly enable or disable it using the following options: - spark.sql.cbo.enabled: Set to true to enable the CBO, or false to disable it. - spark.sql.cbo.strategy: Set to AUTO to use the CBO as the default optimizer, or NONE to disable it completely. HTH Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Mon, 11 Dec 2023 at 02:36, Nicholas Chammas <nicholas.cham...@gmail.com> wrote: > I’ve done some reading and have a slightly better understanding of > statistics now. > > Every implementation of LeafNode.computeStats > <https://github.com/apache/spark/blob/7cea52c96f5be1bc565a033bfd77370ab5527a35/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L210> > offers > its own way to get statistics: > > - LocalRelation > > <https://github.com/apache/spark/blob/8ff6b7a04cbaef9c552789ad5550ceab760cb078/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LocalRelation.scala#L97> > estimates > the size of the relation directly from the row count. > - HiveTableRelation > > <https://github.com/apache/spark/blob/8e95929ac4238d02dca379837ccf2fbc1cd1926d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L923-L929> > pulls > those statistics from the catalog or metastore. > - DataSourceV2Relation > > <https://github.com/apache/spark/blob/5fec76dc8db2499b0a9d76231f9a250871d59658/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala#L100> > delegates > the job of computing statistics to the underlying data source. > > There are a lot of details I’m still fuzzy on, but I think that’s the gist > of things. > > Would it make sense to add a paragraph or two to the SQL performance > tuning page > <https://spark.apache.org/docs/latest/sql-performance-tuning.html> covering > statistics at a high level? Something that briefly explains: > > - what statistics are and how Spark uses them to optimize plans > - the various ways Spark computes or loads statistics (catalog, data > source, runtime, etc.) > - how to gather catalog statistics (i.e. pointer to ANALYZE TABLE) > - how to check statistics on an object (i.e. DESCRIBE EXTENDED) and as > part of an optimized plan (i.e. .explain(mode="cost")) > - what the cost-based optimizer does and how to enable it > > Would this be a welcome addition to the project’s documentation? I’m happy > to work on this. > > > On Dec 5, 2023, at 12:12 PM, Nicholas Chammas <nicholas.cham...@gmail.com> > wrote: > > I’m interested in improving some of the documentation relating to the > table and column statistics that get stored in the metastore, and how Spark > uses them. > > But I’m not clear on a few things, so I’m writing to you with some > questions. > > 1. The documentation for spark.sql.autoBroadcastJoinThreshold > <https://spark.apache.org/docs/latest/sql-performance-tuning.html> implies > that it depends on table statistics to work, but it’s not clear. Is it > accurate to say that unless you have run ANALYZE on the tables > participating in a join, spark.sql.autoBroadcastJoinThreshold cannot impact > the execution plan? > > 2. As a follow-on to the above question, the adaptive version of > autoBroadcastJoinThreshold, namely > spark.sql.adaptive.autoBroadcastJoinThreshold, may still kick in, because > it depends only on runtime statistics and not statistics in the metastore. > Is that correct? I am assuming that “runtime statistics” are gathered on > the fly by Spark, but I would like to mention this in the docs briefly > somewhere. > > 3. The documentation for spark.sql.inMemoryColumnarStorage.compressed > <https://spark.apache.org/docs/latest/sql-performance-tuning.html> mentions > “statistics”, but it’s not clear what kind of statistics we’re talking > about. Are those runtime statistics, metastore statistics (that depend on > you running ANALYZE), or something else? > > 4. The documentation for ANALYZE TABLE > <https://spark.apache.org/docs/latest/sql-ref-syntax-aux-analyze-table.html> > states > that the collected statistics help the optimizer "find a better query > execution plan”. I wish we could link to something from here with more > explanation. Currently, spark.sql.autoBroadcastJoinThreshold is the only > place where metastore statistics are explicitly referenced as impacting the > execution plan. Surely there must be other places, no? Would it be > appropriate to mention the cost-based optimizer framework > <https://issues.apache.org/jira/browse/SPARK-16026> somehow? It doesn’t > appear to have any public documentation outside of Jira. > > Any pointers or information you can provide would be very helpful. Again, > I am interested in contributing some documentation improvements relating to > statistics, but there is a lot I’m not sure about. > > Nick > > >