[jira] [Commented] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running

gabrywu (Jira) Sat, 05 Mar 2022 19:34:07 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17501864#comment-17501864
 ]


gabrywu commented on SPARK-38258:
---------------------------------

[~yumwang] what do you think of it?

> [proposal] collect & update statistics automatically when spark SQL is running
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-38258
>                 URL: https://issues.apache.org/jira/browse/SPARK-38258
>             Project: Spark
>          Issue Type: Wish
>          Components: Spark Core, SQL
>    Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0
>            Reporter: gabrywu
>            Priority: Minor
>
> As we all know, table & column statistics are very important to spark SQL 
> optimizer, however we have to collect & update them using 
> {code:java}
> analyze table tableName compute statistics{code}
> It's a little inconvenient, so why can't we {color:#ff0000}collect & update 
> statistics automatically{color} when a spark stage runs and finishes?
> For example, when a insert overwrite table statement finishes, we can update 
> a corresponding table statistics using SQL metrics. And in following queries, 
> spark sql optimizer can use these statistics.
> As we all know, it's a common case that we run daily batches using Spark 
> SQLs, so a same SQL can run every day, and the SQL and its corresponding 
> tables data change slowly. That means we can use statistics updated on 
> yesterday to optimize current SQLs, of course can also adjust the important 
> configs, such as spark.sql.shuffle.partitions
> So we'd better add a mechanism to store every stage's statistics somewhere, 
> and use it in new SQLs. Not just collect statistics after a stage finishes.
> Of course, we'd better {color:#ff0000}add a version number to 
> statistics{color} in case of losing efficacy



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running

Reply via email to