[ https://issues.apache.org/jira/browse/SPARK-27631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yuming Wang updated SPARK-27631: -------------------------------- Description: How to reproduce: {code:java} build/sbt clean package -Phive -Phadoop-3.2 export SPARK_PREPEND_CLASSES=true bin/spark-shell --conf spark.hadoop.hive.metastore.schema.verification=false --conf spark.hadoop.datanucleus.schema.autoCreateAll=true --conf spark.sql.statistics.size.autoUpdate.enabled=true{code} {code:java} sc.setLogLevel("INFO") spark.sql("create table t1(id int) using hive") spark.sql("insert into t1 values(1)") {code} {noformat} 19/05/03 21:38:53 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 468 ms on localhost (executor driver) (1/1) 19/05/03 21:38:53 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 19/05/03 21:38:53 INFO DAGScheduler: ResultStage 0 (sql at <console>:24) finished in 0.670 s 19/05/03 21:38:53 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job 19/05/03 21:38:53 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished 19/05/03 21:38:53 INFO DAGScheduler: Job 0 finished: sql at <console>:24, took 0.710944 s 19/05/03 21:38:53 INFO FileFormatWriter: Write Job a1db667b-ff3a-454f-a7d1-a4d79d343e6b committed. 19/05/03 21:38:53 INFO FileFormatWriter: Finished processing stats for write job a1db667b-ff3a-454f-a7d1-a4d79d343e6b. 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_table : db=default tbl=t1 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=t1 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_table : db=default tbl=t1 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=t1 19/05/03 21:38:53 INFO SessionState: Could not get hdfsEncryptionShim, it is only applicable to hdfs filesystem. 19/05/03 21:38:53 INFO SessionState: Could not get hdfsEncryptionShim, it is only applicable to hdfs filesystem. 19/05/03 21:38:53 INFO HiveMetaStore: 0: alter_table: db=default tbl=t1 newtbl=t1 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=alter_table: db=default tbl=t1 newtbl=t1 19/05/03 21:38:53 INFO log: Updating table stats fast for t1 19/05/03 21:38:53 INFO log: Updated size of table t1 to 2 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_database: default 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_table : db=default tbl=t1 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=t1 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_table : db=default tbl=t1 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=t1 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_database: default 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_table : db=default tbl=t1 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=t1 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_table : db=default tbl=t1 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=t1 19/05/03 21:38:53 INFO CommandUtils: Starting to calculate the total file size under path Some(file:/root/opensource/spark/spark-warehouse/t1). 19/05/03 21:38:53 INFO CommandUtils: It took 3 ms to calculate the total file size under path Some(file:/root/opensource/spark/spark-warehouse/t1). 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_database: default 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_table : db=default tbl=t1 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=t1 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_table : db=default tbl=t1 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=t1 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_table : db=default tbl=t1 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=t1 19/05/03 21:38:53 INFO HiveMetaStore: 0: alter_table: db=default tbl=t1 newtbl=t1 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=alter_table: db=default tbl=t1 newtbl=t1 19/05/03 21:38:53 INFO log: Updating table stats fast for t1 19/05/03 21:38:53 INFO log: Updated size of table t1 to 2 {noformat} It shows that it has executed {{Updated size of table t1 to 2}} twice. was: How to reproduce: {code} build/sbt clean package -Phive -Phadoop-3.2 export SPARK_PREPEND_CLASSES=true bin/spark-shell --conf spark.hadoop.hive.metastore.schema.verification=false --conf spark.hadoop.datanucleus.schema.autoCreateAll=true --conf spark.sql.statistics.size.autoUpdate.enabled=true{code} {code} sc.setLogLevel("INFO") spark.sql("create table t1(id int) using hive") sql("insert into t1 values(1)") {code} {noformat} 19/05/03 21:38:53 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 468 ms on localhost (executor driver) (1/1) 19/05/03 21:38:53 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 19/05/03 21:38:53 INFO DAGScheduler: ResultStage 0 (sql at <console>:24) finished in 0.670 s 19/05/03 21:38:53 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job 19/05/03 21:38:53 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished 19/05/03 21:38:53 INFO DAGScheduler: Job 0 finished: sql at <console>:24, took 0.710944 s 19/05/03 21:38:53 INFO FileFormatWriter: Write Job a1db667b-ff3a-454f-a7d1-a4d79d343e6b committed. 19/05/03 21:38:53 INFO FileFormatWriter: Finished processing stats for write job a1db667b-ff3a-454f-a7d1-a4d79d343e6b. 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_table : db=default tbl=t1 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=t1 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_table : db=default tbl=t1 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=t1 19/05/03 21:38:53 INFO SessionState: Could not get hdfsEncryptionShim, it is only applicable to hdfs filesystem. 19/05/03 21:38:53 INFO SessionState: Could not get hdfsEncryptionShim, it is only applicable to hdfs filesystem. 19/05/03 21:38:53 INFO HiveMetaStore: 0: alter_table: db=default tbl=t1 newtbl=t1 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=alter_table: db=default tbl=t1 newtbl=t1 19/05/03 21:38:53 INFO log: Updating table stats fast for t1 19/05/03 21:38:53 INFO log: Updated size of table t1 to 2 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_database: default 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_table : db=default tbl=t1 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=t1 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_table : db=default tbl=t1 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=t1 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_database: default 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_table : db=default tbl=t1 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=t1 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_table : db=default tbl=t1 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=t1 19/05/03 21:38:53 INFO CommandUtils: Starting to calculate the total file size under path Some(file:/root/opensource/spark/spark-warehouse/t1). 19/05/03 21:38:53 INFO CommandUtils: It took 3 ms to calculate the total file size under path Some(file:/root/opensource/spark/spark-warehouse/t1). 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_database: default 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_table : db=default tbl=t1 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=t1 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_table : db=default tbl=t1 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=t1 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_table : db=default tbl=t1 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : db=default tbl=t1 19/05/03 21:38:53 INFO HiveMetaStore: 0: alter_table: db=default tbl=t1 newtbl=t1 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=alter_table: db=default tbl=t1 newtbl=t1 19/05/03 21:38:53 INFO log: Updating table stats fast for t1 19/05/03 21:38:53 INFO log: Updated size of table t1 to 2 {noformat} It shows that it has executed {{Updated size of table t1 to 2}} twice. > Avoid repeating calculate table statistics when AUTO_SIZE_UPDATE_ENABLED is > enabled > ----------------------------------------------------------------------------------- > > Key: SPARK-27631 > URL: https://issues.apache.org/jira/browse/SPARK-27631 > Project: Spark > Issue Type: Sub-task > Components: SQL > Affects Versions: 3.0.0 > Reporter: Yuming Wang > Priority: Major > > How to reproduce: > {code:java} > build/sbt clean package -Phive -Phadoop-3.2 > export SPARK_PREPEND_CLASSES=true > bin/spark-shell --conf spark.hadoop.hive.metastore.schema.verification=false > --conf spark.hadoop.datanucleus.schema.autoCreateAll=true --conf > spark.sql.statistics.size.autoUpdate.enabled=true{code} > {code:java} > sc.setLogLevel("INFO") > spark.sql("create table t1(id int) using hive") > spark.sql("insert into t1 values(1)") > {code} > {noformat} > 19/05/03 21:38:53 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) > in 468 ms on localhost (executor driver) (1/1) > 19/05/03 21:38:53 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks > have all completed, from pool > 19/05/03 21:38:53 INFO DAGScheduler: ResultStage 0 (sql at <console>:24) > finished in 0.670 s > 19/05/03 21:38:53 INFO DAGScheduler: Job 0 is finished. Cancelling potential > speculative or zombie tasks for this job > 19/05/03 21:38:53 INFO TaskSchedulerImpl: Killing all running tasks in stage > 0: Stage finished > 19/05/03 21:38:53 INFO DAGScheduler: Job 0 finished: sql at <console>:24, > took 0.710944 s > 19/05/03 21:38:53 INFO FileFormatWriter: Write Job > a1db667b-ff3a-454f-a7d1-a4d79d343e6b committed. > 19/05/03 21:38:53 INFO FileFormatWriter: Finished processing stats for write > job a1db667b-ff3a-454f-a7d1-a4d79d343e6b. > 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_table : db=default tbl=t1 > 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr > cmd=get_table : db=default tbl=t1 > 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_table : db=default tbl=t1 > 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr > cmd=get_table : db=default tbl=t1 > 19/05/03 21:38:53 INFO SessionState: Could not get hdfsEncryptionShim, it is > only applicable to hdfs filesystem. > 19/05/03 21:38:53 INFO SessionState: Could not get hdfsEncryptionShim, it is > only applicable to hdfs filesystem. > 19/05/03 21:38:53 INFO HiveMetaStore: 0: alter_table: db=default tbl=t1 > newtbl=t1 > 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr > cmd=alter_table: db=default tbl=t1 newtbl=t1 > 19/05/03 21:38:53 INFO log: Updating table stats fast for t1 > 19/05/03 21:38:53 INFO log: Updated size of table t1 to 2 > 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_database: default > 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr > cmd=get_database: default > 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_table : db=default tbl=t1 > 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr > cmd=get_table : db=default tbl=t1 > 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_table : db=default tbl=t1 > 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr > cmd=get_table : db=default tbl=t1 > 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_database: default > 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr > cmd=get_database: default > 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_table : db=default tbl=t1 > 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr > cmd=get_table : db=default tbl=t1 > 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_table : db=default tbl=t1 > 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr > cmd=get_table : db=default tbl=t1 > 19/05/03 21:38:53 INFO CommandUtils: Starting to calculate the total file > size under path Some(file:/root/opensource/spark/spark-warehouse/t1). > 19/05/03 21:38:53 INFO CommandUtils: It took 3 ms to calculate the total file > size under path Some(file:/root/opensource/spark/spark-warehouse/t1). > 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_database: default > 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr > cmd=get_database: default > 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_table : db=default tbl=t1 > 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr > cmd=get_table : db=default tbl=t1 > 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_table : db=default tbl=t1 > 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr > cmd=get_table : db=default tbl=t1 > 19/05/03 21:38:53 INFO HiveMetaStore: 0: get_table : db=default tbl=t1 > 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr > cmd=get_table : db=default tbl=t1 > 19/05/03 21:38:53 INFO HiveMetaStore: 0: alter_table: db=default tbl=t1 > newtbl=t1 > 19/05/03 21:38:53 INFO audit: ugi=root ip=unknown-ip-addr > cmd=alter_table: db=default tbl=t1 newtbl=t1 > 19/05/03 21:38:53 INFO log: Updating table stats fast for t1 > 19/05/03 21:38:53 INFO log: Updated size of table t1 to 2 > {noformat} > It shows that it has executed {{Updated size of table t1 to 2}} twice. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org