Re: When and how does Spark use metastore statistics?

Mich Talebzadeh Mon, 11 Dec 2023 03:41:12 -0800

Some of these have been around outside of spark for years. like CBO and RBO
etc but I concur that they have a place in spark's doc.


Simply put, statistics  provide insights into the characteristics of data,
such as distribution, skewness, and cardinalities, which help the optimizer
make informed decisions about data partitioning, aggregation strategies,
and join order.

Not so differently, Spark utilizes statistics to:

   - Partition Data Effectively: Spark partitions data into smaller chunks
   to distribute and parallelize computations across worker nodes. Accurate
   statistics enable the optimizer to choose the most appropriate partitioning
   strategy for each data set, considering factors like data distribution and
   skewness.
   - Optimize Join Operations: Spark employs statistics to determine the
   most efficient join order, considering the join factors and their
   respective cardinalities. This helps reduce the amount of data shuffled
   during joins, improving performance and minimizing data transfer overhead.
   - Choose Optimal Aggregation Strategies: When performing aggregations,
   Spark uses statistics to determine the most efficient aggregation algorithm
   based on the data distribution and the desired aggregation functions. This
   ensures that aggregations are performed efficiently without compromising
   accuracy.


With regard to type of statistics:


   - Catalog Statistics: These are pre-computed statistics that are stored
   in the Spark SQL catalog and associated with table or dataset metadata.
   They are typically gathered using the ANALYZE TABLE statement or through
   data source-specific mechanisms.
   - Data Source Statistics: These statistics are computed by the data
   source itself, such as Parquet or Hive, and are associated with the
   internal format of the data. Spark can access and utilize these statistics
   when working with external data sources.
   - Runtime Statistics: These are statistics that are dynamically computed
   during query execution. Spark can gather runtime statistics for certain
   operations, such as aggregations or joins, to refine its optimization
   decisions based on the actual data encountered.

It is important to mention Cost-Based Optimization (CBO). CBO in Spark
analyzes the query plan and estimates the execution costs associated with
each operation. It uses statistics to guide its decisions, selecting the
plan with the lowest estimated cost. I do not know any RDBMS that uses rule
based optimizer (RBO) anymore.

By default, the CBO is enabled in Spark. However, you can explicitly enable
or disable it using the following options:

   -

   spark.sql.cbo.enabled: Set to true to enable the CBO, or false to
   disable it.
   -

   spark.sql.cbo.strategy: Set to AUTO to use the CBO as the default
   optimizer, or NONE to disable it completely.

HTH
Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 11 Dec 2023 at 02:36, Nicholas Chammas <nicholas.cham...@gmail.com>
wrote:

> I’ve done some reading and have a slightly better understanding of
> statistics now.
>
> Every implementation of LeafNode.computeStats
> <https://github.com/apache/spark/blob/7cea52c96f5be1bc565a033bfd77370ab5527a35/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L210>
>  offers
> its own way to get statistics:
>
>    - LocalRelation
>    
> <https://github.com/apache/spark/blob/8ff6b7a04cbaef9c552789ad5550ceab760cb078/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LocalRelation.scala#L97>
>  estimates
>    the size of the relation directly from the row count.
>    - HiveTableRelation
>    
> <https://github.com/apache/spark/blob/8e95929ac4238d02dca379837ccf2fbc1cd1926d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L923-L929>
>  pulls
>    those statistics from the catalog or metastore.
>    - DataSourceV2Relation
>    
> <https://github.com/apache/spark/blob/5fec76dc8db2499b0a9d76231f9a250871d59658/sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala#L100>
>  delegates
>    the job of computing statistics to the underlying data source.
>
> There are a lot of details I’m still fuzzy on, but I think that’s the gist
> of things.
>
> Would it make sense to add a paragraph or two to the SQL performance
> tuning page
> <https://spark.apache.org/docs/latest/sql-performance-tuning.html> covering
> statistics at a high level? Something that briefly explains:
>
>    - what statistics are and how Spark uses them to optimize plans
>    - the various ways Spark computes or loads statistics (catalog, data
>    source, runtime, etc.)
>    - how to gather catalog statistics (i.e. pointer to ANALYZE TABLE)
>    - how to check statistics on an object (i.e. DESCRIBE EXTENDED) and as
>    part of an optimized plan (i.e. .explain(mode="cost"))
>    - what the cost-based optimizer does and how to enable it
>
> Would this be a welcome addition to the project’s documentation? I’m happy
> to work on this.
>
>
> On Dec 5, 2023, at 12:12 PM, Nicholas Chammas <nicholas.cham...@gmail.com>
> wrote:
>
> I’m interested in improving some of the documentation relating to the
> table and column statistics that get stored in the metastore, and how Spark
> uses them.
>
> But I’m not clear on a few things, so I’m writing to you with some
> questions.
>
> 1. The documentation for spark.sql.autoBroadcastJoinThreshold
> <https://spark.apache.org/docs/latest/sql-performance-tuning.html> implies
> that it depends on table statistics to work, but it’s not clear. Is it
> accurate to say that unless you have run ANALYZE on the tables
> participating in a join, spark.sql.autoBroadcastJoinThreshold cannot impact
> the execution plan?
>
> 2. As a follow-on to the above question, the adaptive version of
> autoBroadcastJoinThreshold, namely
> spark.sql.adaptive.autoBroadcastJoinThreshold, may still kick in, because
> it depends only on runtime statistics and not statistics in the metastore.
> Is that correct? I am assuming that “runtime statistics” are gathered on
> the fly by Spark, but I would like to mention this in the docs briefly
> somewhere.
>
> 3. The documentation for spark.sql.inMemoryColumnarStorage.compressed
> <https://spark.apache.org/docs/latest/sql-performance-tuning.html> mentions
> “statistics”, but it’s not clear what kind of statistics we’re talking
> about. Are those runtime statistics, metastore statistics (that depend on
> you running ANALYZE), or something else?
>
> 4. The documentation for ANALYZE TABLE
> <https://spark.apache.org/docs/latest/sql-ref-syntax-aux-analyze-table.html> 
> states
> that the collected statistics help the optimizer "find a better query
> execution plan”. I wish we could link to something from here with more
> explanation. Currently, spark.sql.autoBroadcastJoinThreshold is the only
> place where metastore statistics are explicitly referenced as impacting the
> execution plan. Surely there must be other places, no? Would it be
> appropriate to mention the cost-based optimizer framework
> <https://issues.apache.org/jira/browse/SPARK-16026> somehow? It doesn’t
> appear to have any public documentation outside of Jira.
>
> Any pointers or information you can provide would be very helpful. Again,
> I am interested in contributing some documentation improvements relating to
> statistics, but there is a lot I’m not sure about.
>
> Nick
>
>
>

Re: When and how does Spark use metastore statistics?

Reply via email to