[ https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rakesh Raushan updated SPARK-44817: ----------------------------------- Summary: SPIP: Incremental Stats Collection (was: Incremental Stats Collection) > SPIP: Incremental Stats Collection > ---------------------------------- > > Key: SPARK-44817 > URL: https://issues.apache.org/jira/browse/SPARK-44817 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.5.0, 4.0.0 > Reporter: Rakesh Raushan > Priority: Major > > Spark's Cost Based Optimizer is dependent on the table and column statistics. > After every execution of DML query, table and column stats are invalidated if > auto update of stats collection is not turned on. To keep stats updated we > need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very > expensive. It is not feasible to run this command after every DML query. > Instead, we can incrementally update the stats during each DML query run > itself. This way our table and column stats would be fresh at all the time > and CBO benefits can be applied. Initially, we can only update table level > stats and gradually start updating column level stats as well. > *Pros:* > 1. Optimize queries over table which is updated frequently. > 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE > STATISTICS` for updating stats. > [SPIP Document > |https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org