[ https://issues.apache.org/jira/browse/SPARK-44817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757539#comment-17757539 ]
Rakesh Raushan commented on SPARK-44817: ---------------------------------------- Sure. I would try to come up with a SPIP by this weekend. > Incremental Stats Collection > ---------------------------- > > Key: SPARK-44817 > URL: https://issues.apache.org/jira/browse/SPARK-44817 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.5.0, 4.0.0 > Reporter: Rakesh Raushan > Priority: Major > > Spark's Cost Based Optimizer is dependent on the table and column statistics. > After every execution of DML query, table and column stats are invalidated if > auto update of stats collection is not turned on. To keep stats updated we > need to run `ANALYZE TABLE COMPUTE STATISTICS` command which is very > expensive. It is not feasible to run this command after every DML query. > Instead, we can incrementally update the stats during each DML query run > itself. This way our table and column stats would be fresh at all the time > and CBO benefits can be applied. Initially, we can only update table level > stats and gradually start updating column level stats as well. > *Pros:* > 1. Optimize queries over table which is updated frequently. > 2. Saves Compute cycles by removing dependency over `ANALYZE TABLE COMPUTE > STATISTICS` for updating stats. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org