[jira] [Commented] (SPARK-44817) Incremental Stats Collection

Rakesh Raushan (Jira) Tue, 22 Aug 2023 08:40:06 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757539#comment-17757539
 ]


Rakesh Raushan commented on SPARK-44817:
----------------------------------------

Sure. I would try to come up with a SPIP by this weekend.

> Incremental Stats Collection
> ----------------------------
>
>                 Key: SPARK-44817
>                 URL: https://issues.apache.org/jira/browse/SPARK-44817
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.5.0, 4.0.0
>            Reporter: Rakesh Raushan
>            Priority: Major
>
> Spark's Cost Based Optimizer is dependent on the table and column statistics.
> After every execution of DML query, table and column stats are invalidated if 
> auto update of stats collection is not turned on. To keep stats updated we 
> need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very 
> expensive. It is not feasible to run this command after every DML query.
> Instead, we can incrementally update the stats during each DML query run 
> itself. This way our table and column stats would be fresh at all the time 
> and CBO benefits can be applied. Initially, we can only update table level 
> stats and gradually start updating column level stats as well.
> *Pros:*
> 1. Optimize queries over table which is updated frequently.
> 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE 
> STATISTICS` for updating stats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44817) Incremental Stats Collection

Reply via email to