[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running

2022-03-05 Thread gabrywu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gabrywu updated SPARK-38258:

Description: 
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
It's a little inconvenient, so why can't we {color:#ff}collect & update 
statistics automatically{color} when a spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metrics. And in following queries, 
spark sql optimizer can use these statistics.

As we all know, it's a common case that we run daily batches using Spark SQLs, 
so a same SQL can run every day, and the SQL and its corresponding tables data 
change slowly. That means we can use statistics updated on yesterday to 
optimize current SQLs, of course can also adjust the important configs, such as 
spark.sql.shuffle.partitions

So we'd better add a mechanism to store every stage's statistics somewhere, and 
use it in new SQLs. Not just collect statistics after a stage finishes.

Of course, we'd better {color:#ff}add a version number to statistics{color} 
in case of losing efficacy

 

https://docs.google.com/document/d/1L48Dovynboi_ARu-OqQNJCOQqeVUTutLu8fo-w_ZPPA/edit#

  was:
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
It's a little inconvenient, so why can't we {color:#ff}collect & update 
statistics automatically{color} when a spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metrics. And in following queries, 
spark sql optimizer can use these statistics.

As we all know, it's a common case that we run daily batches using Spark SQLs, 
so a same SQL can run every day, and the SQL and its corresponding tables data 
change slowly. That means we can use statistics updated on yesterday to 
optimize current SQLs, of course can also adjust the important configs, such as 
spark.sql.shuffle.partitions

So we'd better add a mechanism to store every stage's statistics somewhere, and 
use it in new SQLs. Not just collect statistics after a stage finishes.

Of course, we'd better {color:#ff}add a version number to statistics{color} 
in case of losing efficacy


> [proposal] collect & update statistics automatically when spark SQL is running
> --
>
> Key: SPARK-38258
> URL: https://issues.apache.org/jira/browse/SPARK-38258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0
>Reporter: gabrywu
>Priority: Minor
>
> As we all know, table & column statistics are very important to spark SQL 
> optimizer, however we have to collect & update them using 
> {code:java}
> analyze table tableName compute statistics{code}
> It's a little inconvenient, so why can't we {color:#ff}collect & update 
> statistics automatically{color} when a spark stage runs and finishes?
> For example, when a insert overwrite table statement finishes, we can update 
> a corresponding table statistics using SQL metrics. And in following queries, 
> spark sql optimizer can use these statistics.
> As we all know, it's a common case that we run daily batches using Spark 
> SQLs, so a same SQL can run every day, and the SQL and its corresponding 
> tables data change slowly. That means we can use statistics updated on 
> yesterday to optimize current SQLs, of course can also adjust the important 
> configs, such as spark.sql.shuffle.partitions
> So we'd better add a mechanism to store every stage's statistics somewhere, 
> and use it in new SQLs. Not just collect statistics after a stage finishes.
> Of course, we'd better {color:#ff}add a version number to 
> statistics{color} in case of losing efficacy
>  
> https://docs.google.com/document/d/1L48Dovynboi_ARu-OqQNJCOQqeVUTutLu8fo-w_ZPPA/edit#



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running

2022-03-05 Thread gabrywu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gabrywu updated SPARK-38258:

Description: 
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
It's a little inconvenient, so why can't we {color:#ff}collect & update 
statistics automatically{color} when a spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metrics. And in following queries, 
spark sql optimizer can use these statistics.

As we all know, it's a common case that we run daily batches using Spark SQLs, 
so a same SQL can run every day, and the SQL and its corresponding tables data 
change slowly. That means we can use statistics updated on yesterday to 
optimize current SQLs, of course can also adjust the important configs, such as 
spark.sql.shuffle.partitions

So we'd better add a mechanism to store every stage's statistics somewhere, and 
use it in new SQLs. Not just collect statistics after a stage finishes.

Of course, we'd better {color:#ff}add a version number to statistics{color} 
in case of losing efficacy

  was:
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
It's a little inconvenient, so why can't we {color:#ff}collect & update 
statistics automatically{color} when a spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metrics. And in following queries, 
spark sql optimizer can use these statistics.

As we all know, it's a common case that we run daily batches using Spark SQLs, 
so a same SQL can run every day, and the SQL and its corresponding tables data 
change slowly. That means we can use statistics updated on yesterday to 
optimize current SQLs.

So we'd better add a mechanism to store every stage's statistics somewhere, and 
use it in new SQLs. Not just collect statistics after a stage finishes.

Of course, we'd better {color:#ff}add a version number to statistics{color} 
in case of losing efficacy


> [proposal] collect & update statistics automatically when spark SQL is running
> --
>
> Key: SPARK-38258
> URL: https://issues.apache.org/jira/browse/SPARK-38258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0
>Reporter: gabrywu
>Priority: Minor
>
> As we all know, table & column statistics are very important to spark SQL 
> optimizer, however we have to collect & update them using 
> {code:java}
> analyze table tableName compute statistics{code}
> It's a little inconvenient, so why can't we {color:#ff}collect & update 
> statistics automatically{color} when a spark stage runs and finishes?
> For example, when a insert overwrite table statement finishes, we can update 
> a corresponding table statistics using SQL metrics. And in following queries, 
> spark sql optimizer can use these statistics.
> As we all know, it's a common case that we run daily batches using Spark 
> SQLs, so a same SQL can run every day, and the SQL and its corresponding 
> tables data change slowly. That means we can use statistics updated on 
> yesterday to optimize current SQLs, of course can also adjust the important 
> configs, such as spark.sql.shuffle.partitions
> So we'd better add a mechanism to store every stage's statistics somewhere, 
> and use it in new SQLs. Not just collect statistics after a stage finishes.
> Of course, we'd better {color:#ff}add a version number to 
> statistics{color} in case of losing efficacy



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running

2022-03-04 Thread gabrywu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gabrywu updated SPARK-38258:

Description: 
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
It's a little inconvenient, so why can't we {color:#ff}collect & update 
statistics automatically{color} when a spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metrics. And in following queries, 
spark sql optimizer can use these statistics.

As we all know, it's a common case that we run daily batches using Spark SQLs, 
so a same SQL can run every day, and the SQL and its corresponding tables data 
change slowly. That means we can use statistics updated on yesterday to 
optimize current SQLs.

So we'd better add a mechanism to store every stage's statistics somewhere, and 
use it in new SQLs. Not just collect statistics after a stage finishes.

Of course, we'd better {color:#ff}add a version number to statistics{color} 
in case of losing efficacy

  was:
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
 

It's a little inconvenient, so why can't we {color:#FF}collect & update 
statistics automatically{color} when a spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metrics. And in following queries, 
spark sql optimizer can use these statistics.

As we all know, it's a common case that we run daily batches using Spark SQLs, 
so a same SQL can run every day, and the SQL and its corresponding tables data 
change slowly. That means we can use statistics updated on yesterday to 
optimize current SQLs.

So we'd better add a mechanism to store every stage's statistics somewhere, and 
use it in new SQLs. Not just collect statistics after a stage finishes.

Of course, we'd better {color:#FF}add a version number to statistics{color} 
in case of losing efficacy


> [proposal] collect & update statistics automatically when spark SQL is running
> --
>
> Key: SPARK-38258
> URL: https://issues.apache.org/jira/browse/SPARK-38258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0
>Reporter: gabrywu
>Priority: Minor
>
> As we all know, table & column statistics are very important to spark SQL 
> optimizer, however we have to collect & update them using 
> {code:java}
> analyze table tableName compute statistics{code}
> It's a little inconvenient, so why can't we {color:#ff}collect & update 
> statistics automatically{color} when a spark stage runs and finishes?
> For example, when a insert overwrite table statement finishes, we can update 
> a corresponding table statistics using SQL metrics. And in following queries, 
> spark sql optimizer can use these statistics.
> As we all know, it's a common case that we run daily batches using Spark 
> SQLs, so a same SQL can run every day, and the SQL and its corresponding 
> tables data change slowly. That means we can use statistics updated on 
> yesterday to optimize current SQLs.
> So we'd better add a mechanism to store every stage's statistics somewhere, 
> and use it in new SQLs. Not just collect statistics after a stage finishes.
> Of course, we'd better {color:#ff}add a version number to 
> statistics{color} in case of losing efficacy



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running

2022-03-04 Thread gabrywu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gabrywu updated SPARK-38258:

Description: 
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
 

It's a little inconvenient, so why can't we {color:#FF}collect & update 
statistics automatically{color} when a spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metrics. And in following queries, 
spark sql optimizer can use these statistics.

As we all know, it's a common case that we run daily batches using Spark SQLs, 
so a same SQL can run every day, and the SQL and its corresponding tables data 
change slowly. That means we can use statistics updated on yesterday to 
optimize current SQLs.

So we'd better add a mechanism to store every stage's statistics somewhere, and 
use it in new SQLs. Not just collect statistics after a stage finishes.

Of course, we'd better {color:#FF}add a version number to statistics{color} 
in case of losing efficacy

  was:
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
 

It's a little inconvenient, so why can't we collect & update statistics when a 
spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metrics. And in following queries, 
spark sql optimizer can use these statistics.

As we all know, it's a common case that we run daily batches using Spark SQLs, 
so a same SQL can run every day, and the SQL and its corresponding tables data 
change slowly. That means we can use statistics updated on yesterday to 
optimize current SQLs.

So we'd better add a mechanism to store every stage's statistics somewhere, and 
use it in new SQLs. Not just collect statistics after a stage finishes.

 


> [proposal] collect & update statistics automatically when spark SQL is running
> --
>
> Key: SPARK-38258
> URL: https://issues.apache.org/jira/browse/SPARK-38258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0
>Reporter: gabrywu
>Priority: Minor
>
> As we all know, table & column statistics are very important to spark SQL 
> optimizer, however we have to collect & update them using 
> {code:java}
> analyze table tableName compute statistics{code}
>  
> It's a little inconvenient, so why can't we {color:#FF}collect & update 
> statistics automatically{color} when a spark stage runs and finishes?
> For example, when a insert overwrite table statement finishes, we can update 
> a corresponding table statistics using SQL metrics. And in following queries, 
> spark sql optimizer can use these statistics.
> As we all know, it's a common case that we run daily batches using Spark 
> SQLs, so a same SQL can run every day, and the SQL and its corresponding 
> tables data change slowly. That means we can use statistics updated on 
> yesterday to optimize current SQLs.
> So we'd better add a mechanism to store every stage's statistics somewhere, 
> and use it in new SQLs. Not just collect statistics after a stage finishes.
> Of course, we'd better {color:#FF}add a version number to 
> statistics{color} in case of losing efficacy



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running

2022-02-20 Thread gabrywu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gabrywu updated SPARK-38258:

Description: 
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
 

It's a little inconvenient, so why can't we collect & update statistics when a 
spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metrics. And in following queries, 
spark sql optimizer can use these statistics.

As we all know, it's a common case that we run daily batches using Spark SQLs, 
so a same SQL can run every day, and the SQL and its corresponding tables data 
change slowly. That means we can use statistics updated on yesterday to 
optimize current SQLs.

So we'd better add a mechanism to store every stage's statistics somewhere, and 
use it in new SQLs. Not just collect statistics after a stage finishes.

 

  was:
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
 

It's a little inconvenient, so why can't we collect & update statistics when a 
spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metric. And in following queries, 
spark sql optimizer can use these statistics.

As we all know, it's a common case that we run daily batches using Spark SQLs, 
so a same SQL can run every day, and the SQL and its corresponding tables data 
change slowly. That means we can use statistics updated on yesterday to 
optimize current SQL.

So we'd better add a mechanism to store every stage's statistics somewhere, and 
use it in new SQLs. Not just collect statistics after a stage finishes.

 


> [proposal] collect & update statistics automatically when spark SQL is running
> --
>
> Key: SPARK-38258
> URL: https://issues.apache.org/jira/browse/SPARK-38258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0
>Reporter: gabrywu
>Priority: Minor
>
> As we all know, table & column statistics are very important to spark SQL 
> optimizer, however we have to collect & update them using 
> {code:java}
> analyze table tableName compute statistics{code}
>  
> It's a little inconvenient, so why can't we collect & update statistics when 
> a spark stage runs and finishes?
> For example, when a insert overwrite table statement finishes, we can update 
> a corresponding table statistics using SQL metrics. And in following queries, 
> spark sql optimizer can use these statistics.
> As we all know, it's a common case that we run daily batches using Spark 
> SQLs, so a same SQL can run every day, and the SQL and its corresponding 
> tables data change slowly. That means we can use statistics updated on 
> yesterday to optimize current SQLs.
> So we'd better add a mechanism to store every stage's statistics somewhere, 
> and use it in new SQLs. Not just collect statistics after a stage finishes.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running

2022-02-20 Thread gabrywu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gabrywu updated SPARK-38258:

Description: 
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
 

It's a little inconvenient, so why can't we collect & update statistics when a 
spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metric. And in following queries, 
spark sql optimizer can use these statistics.

As we all know, it's a common case that we run daily batches using Spark SQLs, 
so a same SQL can run every day, and the SQL and its corresponding tables data 
change slowly. That means we can use statistics updated on yesterday to 
optimize current SQL.

So we'd better add a mechanism to store every stage's statistics somewhere, and 
use it in new SQLs. Not just collect statistics after a stage finishes.

 

  was:
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
 

It's a little inconvenient, so why can't we collect & update statistics when a 
spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metric. And in following queries, 
spark sql optimizer can use these statistics.

As we all know, it's a common case that we run daily batches using Spark SQLs, 
so a same SQL can run every day, and the SQL and its corresponding tables data 
change slowly. That means we can use sta

 


> [proposal] collect & update statistics automatically when spark SQL is running
> --
>
> Key: SPARK-38258
> URL: https://issues.apache.org/jira/browse/SPARK-38258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0
>Reporter: gabrywu
>Priority: Minor
>
> As we all know, table & column statistics are very important to spark SQL 
> optimizer, however we have to collect & update them using 
> {code:java}
> analyze table tableName compute statistics{code}
>  
> It's a little inconvenient, so why can't we collect & update statistics when 
> a spark stage runs and finishes?
> For example, when a insert overwrite table statement finishes, we can update 
> a corresponding table statistics using SQL metric. And in following queries, 
> spark sql optimizer can use these statistics.
> As we all know, it's a common case that we run daily batches using Spark 
> SQLs, so a same SQL can run every day, and the SQL and its corresponding 
> tables data change slowly. That means we can use statistics updated on 
> yesterday to optimize current SQL.
> So we'd better add a mechanism to store every stage's statistics somewhere, 
> and use it in new SQLs. Not just collect statistics after a stage finishes.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running

2022-02-20 Thread gabrywu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gabrywu updated SPARK-38258:

Description: 
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
 

It's a little inconvenient, so why can't we collect & update statistics when a 
spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metric. And in following queries, 
spark sql optimizer can use these statistics.

As we all know, it's a common case that we run daily batches using Spark SQLs, 
so a same SQL can run every day, and the SQL and its corresponding tables data 
change slowly. That means we can use sta

 

  was:
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
 

It's a little inconvenient, so why can't we collect & update statistics when a 
spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metric. And in following queries, 
spark sql optimizer can use these statistics.

So what do you think of it?[~yumwang] , it it reasonable?


> [proposal] collect & update statistics automatically when spark SQL is running
> --
>
> Key: SPARK-38258
> URL: https://issues.apache.org/jira/browse/SPARK-38258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0
>Reporter: gabrywu
>Priority: Minor
>
> As we all know, table & column statistics are very important to spark SQL 
> optimizer, however we have to collect & update them using 
> {code:java}
> analyze table tableName compute statistics{code}
>  
> It's a little inconvenient, so why can't we collect & update statistics when 
> a spark stage runs and finishes?
> For example, when a insert overwrite table statement finishes, we can update 
> a corresponding table statistics using SQL metric. And in following queries, 
> spark sql optimizer can use these statistics.
> As we all know, it's a common case that we run daily batches using Spark 
> SQLs, so a same SQL can run every day, and the SQL and its corresponding 
> tables data change slowly. That means we can use sta
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running

2022-02-20 Thread gabrywu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gabrywu updated SPARK-38258:

Affects Version/s: 2.4.0

> [proposal] collect & update statistics automatically when spark SQL is running
> --
>
> Key: SPARK-38258
> URL: https://issues.apache.org/jira/browse/SPARK-38258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0
>Reporter: gabrywu
>Priority: Minor
>
> As we all know, table & column statistics are very important to spark SQL 
> optimizer, however we have to collect & update them using 
> {code:java}
> analyze table tableName compute statistics{code}
>  
> It's a little inconvenient, so why can't we collect & update statistics when 
> a spark stage runs and finishes?
> For example, when a insert overwrite table statement finishes, we can update 
> a corresponding table statistics using SQL metric. And in following queries, 
> spark sql optimizer can use these statistics.
> So what do you think of it?[~yumwang] , it it reasonable?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running

2022-02-20 Thread gabrywu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gabrywu updated SPARK-38258:

Description: 
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
 

It's a little inconvenient, so why can't we collect & update statistics when a 
spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metric. And in following queries, 
spark sql optimizer can use these statistics.

So what do you think of it?[~yumwang] , it it reasonable?

  was:
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
 

It's a little inconvenient, so why can't we collect & update statistics when a 
spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metric. And in next queries, spark sql 
optimizer can use these statistics.

So what do you think of it?[~yumwang] 


> [proposal] collect & update statistics automatically when spark SQL is running
> --
>
> Key: SPARK-38258
> URL: https://issues.apache.org/jira/browse/SPARK-38258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0
>Reporter: gabrywu
>Priority: Minor
>
> As we all know, table & column statistics are very important to spark SQL 
> optimizer, however we have to collect & update them using 
> {code:java}
> analyze table tableName compute statistics{code}
>  
> It's a little inconvenient, so why can't we collect & update statistics when 
> a spark stage runs and finishes?
> For example, when a insert overwrite table statement finishes, we can update 
> a corresponding table statistics using SQL metric. And in following queries, 
> spark sql optimizer can use these statistics.
> So what do you think of it?[~yumwang] , it it reasonable?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org