Rakesh Raushan created SPARK-44817:
--------------------------------------

             Summary: Incremental Stats Collection
                 Key: SPARK-44817
                 URL: https://issues.apache.org/jira/browse/SPARK-44817
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 4.0.0
            Reporter: Rakesh Raushan


Spark's Cost Based Optimizer is dependent on the table and column statistics.

After every execution of DML query, table and column stats are invalidated if 
auto update of stats collection is not turned on. To keep stats updated we need 
to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very expensive. It 
is not feasible to run this command after every DML query.

Instead, we can incrementally update the stats during each DML query run 
itself. This way our table and column stats would be fresh at all the time and 
CBO benefits can be applied.

*Pros:*

1. Optimize queries over table which is updated frequently.
2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE 
STATISTICS` for updating stats.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to