[ https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17501864#comment-17501864 ]
gabrywu commented on SPARK-38258: --------------------------------- [~yumwang] what do you think of it? > [proposal] collect & update statistics automatically when spark SQL is running > ------------------------------------------------------------------------------ > > Key: SPARK-38258 > URL: https://issues.apache.org/jira/browse/SPARK-38258 > Project: Spark > Issue Type: Wish > Components: Spark Core, SQL > Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0 > Reporter: gabrywu > Priority: Minor > > As we all know, table & column statistics are very important to spark SQL > optimizer, however we have to collect & update them usingĀ > {code:java} > analyze table tableName compute statistics{code} > It's a little inconvenient, so why can't we {color:#ff0000}collect & update > statistics automatically{color} when a spark stage runs and finishes? > For example, when a insert overwrite table statement finishes, we can update > a corresponding table statistics using SQL metrics. And in following queries, > spark sql optimizer can use these statistics. > As we all know, it's a common case that we run daily batches using Spark > SQLs, so a same SQL can run every day, and the SQL and its corresponding > tables data change slowly. That means we can use statistics updated on > yesterday to optimize current SQLs, of course can also adjust the important > configs, such as spark.sql.shuffle.partitions > So we'd better add a mechanism to store every stage's statistics somewhere, > and use it in new SQLs. Not just collect statistics after a stage finishes. > Of course, we'd better {color:#ff0000}add a version number to > statistics{color} in case of losing efficacy -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org