[ 
https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-44817:
-----------------------------------
    Summary: SPIP: Incremental Stats Collection  (was: Incremental Stats 
Collection)

> SPIP: Incremental Stats Collection
> ----------------------------------
>
>                 Key: SPARK-44817
>                 URL: https://issues.apache.org/jira/browse/SPARK-44817
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.5.0, 4.0.0
>            Reporter: Rakesh Raushan
>            Priority: Major
>
> Spark's Cost Based Optimizer is dependent on the table and column statistics.
> After every execution of DML query, table and column stats are invalidated if 
> auto update of stats collection is not turned on. To keep stats updated we 
> need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very 
> expensive. It is not feasible to run this command after every DML query.
> Instead, we can incrementally update the stats during each DML query run 
> itself. This way our table and column stats would be fresh at all the time 
> and CBO benefits can be applied. Initially, we can only update table level 
> stats and gradually start updating column level stats as well.
> *Pros:*
> 1. Optimize queries over table which is updated frequently.
> 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE 
> STATISTICS` for updating stats.
> [SPIP Document 
> |https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to