Hi all, I would like to propose the incremental collection of statistics in spark. SPARK-44817 <https://issues.apache.org/jira/browse/SPARK-44817> has been raised for the same.
Currently, spark invalidates the stats after data changing commands which would make CBO non-functional. To update these stats, user either needs to run `ANALYZE TABLE` command or turn `spark.sql.statistics.size.autoUpdate.enabled`. Both of these ways have their own drawbacks, executing `ANALYZE TABLE` command triggers full table scan while the other one only updates table and partition stats and can be costly in certain cases. The goal of this proposal is to collect stats incrementally while executing data changing commands by utilizing the framework introduced in SPARK-21669 <https://issues.apache.org/jira/browse/SPARK-21669>. SPIP Document has been attached along with JIRA: https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing Hive also supports automatic collection of statistics to keep the stats consistent. I can find multiple spark JIRAs asking for the same: https://issues.apache.org/jira/browse/SPARK-28872 https://issues.apache.org/jira/browse/SPARK-33825 Regards, Rakesh