[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18334#discussion_r125154201 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala --- @@ -97,6 +106,10 @@ object CommandUtils extends Logging { 0L } }.getOrElse(0L) +val durationInMs = (System.nanoTime() - startTime) / (1000 * 1000) +logInfo(s"It took $durationInMs ms to calculate the total file size under path $locationUri.") --- End diff -- Actually, the log message contains the timestamp. It does not need to calculate the total time, but I think it is fine here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18334 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18334#discussion_r124977705 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala --- @@ -437,7 +437,20 @@ case class AlterTableAddPartitionCommand( } catalog.createPartitions(table.identifier, parts, ignoreIfExists = ifNotExists) -CommandUtils.updateTableStats(sparkSession, table) +if (table.stats.nonEmpty) { + if (sparkSession.sessionState.conf.autoUpdateSize) { +val addedSize = parts.map { part => + CommandUtils.calculateLocationSize(sparkSession.sessionState, table.identifier, --- End diff -- INFO should be fine, because it should not be a lot, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18334#discussion_r124977390 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -774,6 +774,12 @@ object SQLConf { .doubleConf .createWithDefault(0.05) + val AUTO_UPDATE_SIZE = +buildConf("spark.sql.statistics.autoUpdate.size") + .doc("Enables automatic update for table size once table's data is changed.") --- End diff -- Yea, we should clearly document the influence of the flag. Thanks for pointing this out. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18334#discussion_r124977456 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala --- @@ -437,7 +437,20 @@ case class AlterTableAddPartitionCommand( } catalog.createPartitions(table.identifier, parts, ignoreIfExists = ifNotExists) -CommandUtils.updateTableStats(sparkSession, table) +if (table.stats.nonEmpty) { + if (sparkSession.sessionState.conf.autoUpdateSize) { +val addedSize = parts.map { part => + CommandUtils.calculateLocationSize(sparkSession.sessionState, table.identifier, --- End diff -- OK, shall we log in info level or debug level? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18334#discussion_r124971615 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -774,6 +774,12 @@ object SQLConf { .doubleConf .createWithDefault(0.05) + val AUTO_UPDATE_SIZE = +buildConf("spark.sql.statistics.autoUpdate.size") + .doc("Enables automatic update for table size once table's data is changed.") --- End diff -- This flag could slow down the whole data change commands. We need to clearly explain the potential performance regression. Based on the current description, all the users are willing to turn it on. Normally, the users expect we are doing the incremental updates, instead of recalculating it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18334#discussion_r124971431 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala --- @@ -437,7 +437,20 @@ case class AlterTableAddPartitionCommand( } catalog.createPartitions(table.identifier, parts, ignoreIfExists = ifNotExists) -CommandUtils.updateTableStats(sparkSession, table) +if (table.stats.nonEmpty) { + if (sparkSession.sessionState.conf.autoUpdateSize) { +val addedSize = parts.map { part => + CommandUtils.calculateLocationSize(sparkSession.sessionState, table.identifier, --- End diff -- In the function `calculateLocationSize`, please add log messages when starting/finishing the statistics collection and --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18334#discussion_r124469169 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -448,6 +433,145 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto "ALTER TABLE unset_prop_table UNSET TBLPROPERTIES ('prop1')") } + /** + * To see if stats exist, we need to check spark's stats properties instead of catalog + * statistics, because hive would change stats in metastore and thus change catalog statistics. + */ + private def getStatsProperties(tableName: String): Map[String, String] = { +val hTable = hiveClient.getTable(spark.sessionState.catalog.getCurrentDatabase, tableName) +hTable.properties.filterKeys(_.startsWith(STATISTICS_PREFIX)) + } + + test("change stats after insert command for hive table") { +val table = s"change_stats_insert_hive_table" +Seq(false, true).foreach { autoUpdate => + withSQLConf(SQLConf.AUTO_UPDATE_SIZE.key -> autoUpdate.toString) { +withTable(table) { + sql(s"CREATE TABLE $table (i int, j string)") + // analyze to get initial stats + sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS i, j") + val fetched1 = checkTableStats(table, hasSizeInBytes = true, expectedRowCounts = Some(0)) + assert(fetched1.get.sizeInBytes == 0) + assert(fetched1.get.colStats.size == 2) + + // insert into command + sql(s"INSERT INTO TABLE $table SELECT 1, 'abc'") + if (autoUpdate) { +val fetched2 = checkTableStats(table, hasSizeInBytes = true, expectedRowCounts = None) +assert(fetched2.get.sizeInBytes > 0) +assert(fetched2.get.colStats.isEmpty) +val statsProp = getStatsProperties(table) +assert(statsProp(STATISTICS_TOTAL_SIZE).toLong == fetched2.get.sizeInBytes) + } else { +assert(getStatsProperties(table).isEmpty) + } +} + } +} + } + + test("change stats after load data command") { +val table = "change_stats_load_table" +Seq(false, true).foreach { autoUpdate => + withSQLConf(SQLConf.AUTO_UPDATE_SIZE.key -> autoUpdate.toString) { +withTable(table) { + sql(s"CREATE TABLE $table (i INT, j STRING) STORED AS PARQUET") + // analyze to get initial stats + sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS i, j") + val fetched1 = checkTableStats(table, hasSizeInBytes = true, expectedRowCounts = Some(0)) + assert(fetched1.get.sizeInBytes == 0) + assert(fetched1.get.colStats.size == 2) + + withTempDir { loadPath => +// load data command +val file = new File(loadPath + "/data") +val writer = new PrintWriter(file) +writer.write("2,xyz") +writer.close() +sql(s"LOAD DATA INPATH '${loadPath.toURI.toString}' INTO TABLE $table") +if (autoUpdate) { + val fetched2 = checkTableStats(table, hasSizeInBytes = true, expectedRowCounts = None) + assert(fetched2.get.sizeInBytes > 0) + assert(fetched2.get.colStats.isEmpty) + val statsProp = getStatsProperties(table) + assert(statsProp(STATISTICS_TOTAL_SIZE).toLong == fetched2.get.sizeInBytes) +} else { + assert(getStatsProperties(table).isEmpty) +} + } +} + } +} + } + + test("change stats after add/drop partition command") { --- End diff -- we can't get table properties through hiveClient in the parent class. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18334#discussion_r124456568 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -448,6 +433,145 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto "ALTER TABLE unset_prop_table UNSET TBLPROPERTIES ('prop1')") } + /** + * To see if stats exist, we need to check spark's stats properties instead of catalog + * statistics, because hive would change stats in metastore and thus change catalog statistics. --- End diff -- Hive updates stats for some formats, e.g. in the test `test statistics of LogicalRelation converted from Hive serde tables`, hive updates stats for orc table but not parquet table. I think it's tricky and fussy to relay on hive's stats, we should rely on spark's stats. Here I just eliminate hive's influence for checking. As in `CatalogStatistics` we first fill in hive's stats and then override it using spark's stats, we can't tell where the stats is from if we check `CatalogStatistics`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18334#discussion_r124445546 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -448,6 +433,145 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto "ALTER TABLE unset_prop_table UNSET TBLPROPERTIES ('prop1')") } + /** + * To see if stats exist, we need to check spark's stats properties instead of catalog + * statistics, because hive would change stats in metastore and thus change catalog statistics. + */ + private def getStatsProperties(tableName: String): Map[String, String] = { +val hTable = hiveClient.getTable(spark.sessionState.catalog.getCurrentDatabase, tableName) +hTable.properties.filterKeys(_.startsWith(STATISTICS_PREFIX)) + } + + test("change stats after insert command for hive table") { +val table = s"change_stats_insert_hive_table" +Seq(false, true).foreach { autoUpdate => + withSQLConf(SQLConf.AUTO_UPDATE_SIZE.key -> autoUpdate.toString) { +withTable(table) { + sql(s"CREATE TABLE $table (i int, j string)") + // analyze to get initial stats + sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS i, j") + val fetched1 = checkTableStats(table, hasSizeInBytes = true, expectedRowCounts = Some(0)) + assert(fetched1.get.sizeInBytes == 0) + assert(fetched1.get.colStats.size == 2) + + // insert into command + sql(s"INSERT INTO TABLE $table SELECT 1, 'abc'") + if (autoUpdate) { +val fetched2 = checkTableStats(table, hasSizeInBytes = true, expectedRowCounts = None) +assert(fetched2.get.sizeInBytes > 0) +assert(fetched2.get.colStats.isEmpty) +val statsProp = getStatsProperties(table) +assert(statsProp(STATISTICS_TOTAL_SIZE).toLong == fetched2.get.sizeInBytes) + } else { +assert(getStatsProperties(table).isEmpty) + } +} + } +} + } + + test("change stats after load data command") { +val table = "change_stats_load_table" +Seq(false, true).foreach { autoUpdate => + withSQLConf(SQLConf.AUTO_UPDATE_SIZE.key -> autoUpdate.toString) { +withTable(table) { + sql(s"CREATE TABLE $table (i INT, j STRING) STORED AS PARQUET") + // analyze to get initial stats + sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS i, j") + val fetched1 = checkTableStats(table, hasSizeInBytes = true, expectedRowCounts = Some(0)) + assert(fetched1.get.sizeInBytes == 0) + assert(fetched1.get.colStats.size == 2) + + withTempDir { loadPath => +// load data command +val file = new File(loadPath + "/data") +val writer = new PrintWriter(file) +writer.write("2,xyz") +writer.close() +sql(s"LOAD DATA INPATH '${loadPath.toURI.toString}' INTO TABLE $table") +if (autoUpdate) { + val fetched2 = checkTableStats(table, hasSizeInBytes = true, expectedRowCounts = None) + assert(fetched2.get.sizeInBytes > 0) + assert(fetched2.get.colStats.isEmpty) + val statsProp = getStatsProperties(table) + assert(statsProp(STATISTICS_TOTAL_SIZE).toLong == fetched2.get.sizeInBytes) +} else { + assert(getStatsProperties(table).isEmpty) +} + } +} + } +} + } + + test("change stats after add/drop partition command") { --- End diff -- can we put these tests in the parent class? so we don't need to duplicate the code for hvie and data source tables. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18334#discussion_r124445434 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -448,6 +433,145 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto "ALTER TABLE unset_prop_table UNSET TBLPROPERTIES ('prop1')") } + /** + * To see if stats exist, we need to check spark's stats properties instead of catalog + * statistics, because hive would change stats in metastore and thus change catalog statistics. --- End diff -- do you mean we will have table stats anyway as hive metastore stores it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18334#discussion_r124445210 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala --- @@ -168,6 +157,92 @@ class StatisticsCollectionSuite extends StatisticsCollectionTestBase with Shared assert(stats.simpleString == expectedString) } } + + test("change stats after truncate command") { +val table = "change_stats_truncate_table" +Seq(false, true).foreach { autoUpdate => + withSQLConf(SQLConf.AUTO_UPDATE_SIZE.key -> autoUpdate.toString) { --- End diff -- `TruncateTableCommand` doesn't read the `AUTO_UPDATE_SIZE` config at all, we don't need to test with this config. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18334#discussion_r124445015 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala --- @@ -161,6 +161,11 @@ case class InsertIntoHadoopFsRelationCommand( fileIndex.foreach(_.refresh()) // refresh data cache if table is cached sparkSession.catalog.refreshByPath(outputPath.toString) + + if (catalogTable.nonEmpty) { --- End diff -- do we need this if? We also do it in `CommandUtils.updateTableStats` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18334#discussion_r124433756 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala --- @@ -161,6 +161,11 @@ case class InsertIntoHadoopFsRelationCommand( fileIndex.foreach(_.refresh()) // refresh data cache if table is cached sparkSession.catalog.refreshByPath(outputPath.toString) + + if (catalogTable.nonEmpty) { +CommandUtils.updateTableStats(sparkSession, catalogTable.get) --- End diff -- We already add a flag to enable/disable auto update, the flag is used inside `CommandUtils.updateTableStats` in order to reduce code redundancy. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18334#discussion_r124323569 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala --- @@ -161,6 +161,11 @@ case class InsertIntoHadoopFsRelationCommand( fileIndex.foreach(_.refresh()) // refresh data cache if table is cached sparkSession.catalog.refreshByPath(outputPath.toString) + + if (catalogTable.nonEmpty) { +CommandUtils.updateTableStats(sparkSession, catalogTable.get) --- End diff -- > hive.stats.autogather > Default Value: true > Added In: Hive 0.7 with HIVE-1361 > A flag to gather statistics automatically during the INSERT OVERWRITE command. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18334#discussion_r124323435 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala --- @@ -161,6 +161,11 @@ case class InsertIntoHadoopFsRelationCommand( fileIndex.foreach(_.refresh()) // refresh data cache if table is cached sparkSession.catalog.refreshByPath(outputPath.toString) + + if (catalogTable.nonEmpty) { +CommandUtils.updateTableStats(sparkSession, catalogTable.get) --- End diff -- This is not acceptable in the cloud environment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18334#discussion_r124323305 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala --- @@ -161,6 +161,11 @@ case class InsertIntoHadoopFsRelationCommand( fileIndex.foreach(_.refresh()) // refresh data cache if table is cached sparkSession.catalog.refreshByPath(outputPath.toString) + + if (catalogTable.nonEmpty) { +CommandUtils.updateTableStats(sparkSession, catalogTable.get) --- End diff -- When the number of files are large, it is very slow. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18334#discussion_r123998237 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -774,6 +774,13 @@ object SQLConf { .doubleConf .createWithDefault(0.05) + val AUTO_UPDATE_SIZE = +buildConf("spark.sql.statistics.autoUpdate.size") + .doc("Enables automatic update for table size once table's data is changed. " + + "Note that this update also removes other statistics (e.g. row count and column statistics)") --- End diff -- this is misleading, even they don't do auto update, row count and column statistics will also be removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18334#discussion_r123952350 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala --- @@ -0,0 +1,112 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql.execution.command + +import java.net.URI + +import scala.util.control.NonFatal + +import org.apache.hadoop.fs.{FileSystem, Path} + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.TableIdentifier +import org.apache.spark.sql.catalyst.catalog.{CatalogStatistics, CatalogTable} +import org.apache.spark.sql.internal.SessionState + + +object CommandUtils extends Logging { + + /** + * Update statistics (currently only sizeInBytes) after changing data by commands. + */ + def updateTableStats( + sparkSession: SparkSession, + table: CatalogTable, + newTableSize: Option[BigInt] = None, + newRowCount: Option[BigInt] = None): Unit = { +if (sparkSession.sessionState.conf.autoStatsUpdate && table.stats.nonEmpty) { + val catalog = sparkSession.sessionState.catalog + val newTable = catalog.getTableMetadata(table.identifier) + val newSize = newTableSize.getOrElse( +CommandUtils.calculateTotalSize(sparkSession.sessionState, newTable)) + catalog.alterTableStats(table.identifier, +CatalogStatistics(sizeInBytes = newSize, rowCount = newRowCount)) --- End diff -- Sure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18334#discussion_r123944288 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala --- @@ -0,0 +1,112 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql.execution.command + +import java.net.URI + +import scala.util.control.NonFatal + +import org.apache.hadoop.fs.{FileSystem, Path} + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.TableIdentifier +import org.apache.spark.sql.catalyst.catalog.{CatalogStatistics, CatalogTable} +import org.apache.spark.sql.internal.SessionState + + +object CommandUtils extends Logging { + + /** + * Update statistics (currently only sizeInBytes) after changing data by commands. + */ + def updateTableStats( + sparkSession: SparkSession, + table: CatalogTable, + newTableSize: Option[BigInt] = None, + newRowCount: Option[BigInt] = None): Unit = { +if (sparkSession.sessionState.conf.autoStatsUpdate && table.stats.nonEmpty) { + val catalog = sparkSession.sessionState.catalog + val newTable = catalog.getTableMetadata(table.identifier) + val newSize = newTableSize.getOrElse( +CommandUtils.calculateTotalSize(sparkSession.sessionState, newTable)) + catalog.alterTableStats(table.identifier, +CatalogStatistics(sizeInBytes = newSize, rowCount = newRowCount)) --- End diff -- then shall we update the config name? we need to mention that it only auto updates table size --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18334#discussion_r123937554 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala --- @@ -0,0 +1,112 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql.execution.command + +import java.net.URI + +import scala.util.control.NonFatal + +import org.apache.hadoop.fs.{FileSystem, Path} + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.TableIdentifier +import org.apache.spark.sql.catalyst.catalog.{CatalogStatistics, CatalogTable} +import org.apache.spark.sql.internal.SessionState + + +object CommandUtils extends Logging { + + /** + * Update statistics (currently only sizeInBytes) after changing data by commands. + */ + def updateTableStats( + sparkSession: SparkSession, + table: CatalogTable, + newTableSize: Option[BigInt] = None, + newRowCount: Option[BigInt] = None): Unit = { +if (sparkSession.sessionState.conf.autoStatsUpdate && table.stats.nonEmpty) { + val catalog = sparkSession.sessionState.catalog + val newTable = catalog.getTableMetadata(table.identifier) + val newSize = newTableSize.getOrElse( +CommandUtils.calculateTotalSize(sparkSession.sessionState, newTable)) + catalog.alterTableStats(table.identifier, +CatalogStatistics(sizeInBytes = newSize, rowCount = newRowCount)) --- End diff -- Since updating size only needs to calculate file sizes, while updating other stats needs to scan the data, the two costs are in different orders of magnitude. How about adding another flag to auto update other stats in a separate pr or followup? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18334#discussion_r123935398 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala --- @@ -0,0 +1,112 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql.execution.command + +import java.net.URI + +import scala.util.control.NonFatal + +import org.apache.hadoop.fs.{FileSystem, Path} + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.TableIdentifier +import org.apache.spark.sql.catalyst.catalog.{CatalogStatistics, CatalogTable} +import org.apache.spark.sql.internal.SessionState + + +object CommandUtils extends Logging { + + /** + * Update statistics (currently only sizeInBytes) after changing data by commands. + */ + def updateTableStats( + sparkSession: SparkSession, + table: CatalogTable, + newTableSize: Option[BigInt] = None, + newRowCount: Option[BigInt] = None): Unit = { +if (sparkSession.sessionState.conf.autoStatsUpdate && table.stats.nonEmpty) { + val catalog = sparkSession.sessionState.catalog + val newTable = catalog.getTableMetadata(table.identifier) + val newSize = newTableSize.getOrElse( +CommandUtils.calculateTotalSize(sparkSession.sessionState, newTable)) + catalog.alterTableStats(table.identifier, +CatalogStatistics(sizeInBytes = newSize, rowCount = newRowCount)) --- End diff -- In my initial design, I planned to update size by default and update other stats based on a flag. Auto update all stats seems too aggressive to me with one flag... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18334#discussion_r123930251 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala --- @@ -0,0 +1,112 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql.execution.command + +import java.net.URI + +import scala.util.control.NonFatal + +import org.apache.hadoop.fs.{FileSystem, Path} + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.TableIdentifier +import org.apache.spark.sql.catalyst.catalog.{CatalogStatistics, CatalogTable} +import org.apache.spark.sql.internal.SessionState + + +object CommandUtils extends Logging { + + /** + * Update statistics (currently only sizeInBytes) after changing data by commands. + */ + def updateTableStats( + sparkSession: SparkSession, + table: CatalogTable, + newTableSize: Option[BigInt] = None, + newRowCount: Option[BigInt] = None): Unit = { +if (sparkSession.sessionState.conf.autoStatsUpdate && table.stats.nonEmpty) { + val catalog = sparkSession.sessionState.catalog + val newTable = catalog.getTableMetadata(table.identifier) + val newSize = newTableSize.getOrElse( +CommandUtils.calculateTotalSize(sparkSession.sessionState, newTable)) + catalog.alterTableStats(table.identifier, +CatalogStatistics(sizeInBytes = newSize, rowCount = newRowCount)) --- End diff -- since we are protected by a flag, can we be more aggressive and auto update all stats? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18334#discussion_r123929887 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala --- @@ -165,6 +167,22 @@ private[sql] trait SQLTestUtils } /** + * Creates the specified number of temporary directories, which is then passed to `f` and will be + * deleted after `f` returns. + */ + protected def withTempPaths(numPaths: Int)(f: Seq[File] => Unit): Unit = { +val files = mutable.Buffer[File]() --- End diff -- nit: we can just create an array as we know the size. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18334#discussion_r123928827 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala --- @@ -0,0 +1,112 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql.execution.command + +import java.net.URI + +import scala.util.control.NonFatal + +import org.apache.hadoop.fs.{FileSystem, Path} + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.TableIdentifier +import org.apache.spark.sql.catalyst.catalog.{CatalogStatistics, CatalogTable} +import org.apache.spark.sql.internal.SessionState + + +object CommandUtils extends Logging { + + /** + * Update statistics (currently only sizeInBytes) after changing data by commands. + */ + def updateTableStats( + sparkSession: SparkSession, + table: CatalogTable, + newTableSize: Option[BigInt] = None, + newRowCount: Option[BigInt] = None): Unit = { +if (sparkSession.sessionState.conf.autoStatsUpdate && table.stats.nonEmpty) { + val catalog = sparkSession.sessionState.catalog + val newTable = catalog.getTableMetadata(table.identifier) + val newSize = newTableSize.getOrElse( +CommandUtils.calculateTotalSize(sparkSession.sessionState, newTable)) + catalog.alterTableStats(table.identifier, +CatalogStatistics(sizeInBytes = newSize, rowCount = newRowCount)) --- End diff -- so we never auto update column stats? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18334#discussion_r123891656 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -773,6 +773,12 @@ object SQLConf { .doubleConf .createWithDefault(0.05) + val AUTO_STATS_UPDATE = +buildConf("spark.sql.statistics.autoUpdate") + .doc("Enables automatic statistics update once table's data is changed.") + .booleanConf + .createWithDefault(false) --- End diff -- we can set it to true by default once we have incremental stats collection. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18334#discussion_r122790775 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala --- @@ -161,6 +161,11 @@ case class InsertIntoHadoopFsRelationCommand( fileIndex.foreach(_.refresh()) // refresh data cache if table is cached sparkSession.catalog.refreshByPath(outputPath.toString) + + if (catalogTable.nonEmpty) { +CommandUtils.updateTableStats(sparkSession, catalogTable.get) --- End diff -- It's difficult to get the size of `query` in insert command. Besides, I personally prefer simple implementation here, because usually the overhead of scanning files is negligible compared to the append operation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18334#discussion_r122600223 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala --- @@ -161,6 +161,11 @@ case class InsertIntoHadoopFsRelationCommand( fileIndex.foreach(_.refresh()) // refresh data cache if table is cached sparkSession.catalog.refreshByPath(outputPath.toString) + + if (catalogTable.nonEmpty) { +CommandUtils.updateTableStats(sparkSession, catalogTable.get) --- End diff -- For append mode, is that possible we can just add the delta, instead of re-calculating the whole table? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...
GitHub user wzhfy opened a pull request: https://github.com/apache/spark/pull/18334 [SPARK-21127] [SQL] Update statistics after data changing commands ## What changes were proposed in this pull request? Update stats after the following data changing commands: - InsertIntoHadoopFsRelationCommand - InsertIntoHiveTable - LoadDataCommand - TruncateTableCommand - AlterTableSetLocationCommand - AlterTableDropPartitionCommand ## How was this patch tested? Added new test cases. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wzhfy/spark changeStatsForOperation Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18334.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18334 commit 9d4d97a272a74d99f79026648b0af72b4f5249ab Author: Zhenhua Wang Date: 2017-06-17T07:31:35Z update stats after commands --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org