Hi all,

I would like to propose the incremental collection of statistics in spark.
SPARK-44817 <https://issues.apache.org/jira/browse/SPARK-44817> has been
raised for the same.

Currently, spark invalidates the stats after data changing commands which
would make CBO non-functional. To update these stats, user either needs to
run `ANALYZE TABLE` command or turn
`spark.sql.statistics.size.autoUpdate.enabled`. Both of these ways have
their own drawbacks, executing `ANALYZE TABLE` command triggers full table
scan while the other one only updates table and partition stats and can be
costly in certain cases.

The goal of this proposal is to collect stats incrementally while executing
data changing commands by utilizing the framework introduced in SPARK-21669
<https://issues.apache.org/jira/browse/SPARK-21669>.

SPIP Document has been attached along with JIRA:
https://docs.google.com/document/d/1CNPWg_L1fxfB4d2m6xfizRyYRoWS2uPCwTKzhL2fwaQ/edit?usp=sharing

Hive also supports automatic collection of statistics to keep the stats
consistent.
I can find multiple spark JIRAs asking for the same:
https://issues.apache.org/jira/browse/SPARK-28872
https://issues.apache.org/jira/browse/SPARK-33825

Regards,
Rakesh

Reply via email to