[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18309 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r123890894 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -128,6 +129,77 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + test("SPARK-21079 - analyze table with location different than that of individual partitions") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +val tableName = "analyzeTable_part" +withTable(tableName) { + withTempPath { path => +sql(s"CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED BY (ds STRING)") + +val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03") +partitionDates.foreach { ds => + sql(s"INSERT INTO TABLE $tableName PARTITION (ds='$ds') SELECT * FROM src") +} + +sql(s"ALTER TABLE $tableName SET LOCATION '$path'") + +sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS noscan") + +assert(queryTotalSize(tableName) === BigInt(17436)) + } +} + } + + test("SPARK-21079 - analyze partitioned table with only a subset of partitions visible") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +val sourceTableName = "analyzeTable_part" +val tableName = "analyzeTable_part_vis" +withTable(sourceTableName, tableName) { + withTempPath { path => + // Create a table with 3 partitions all located under a single top-level directory 'path' + sql( +s""" + |CREATE TABLE $sourceTableName (key STRING, value STRING) + |PARTITIONED BY (ds STRING) + |LOCATION '$path' + """.stripMargin) + + val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03") + partitionDates.foreach { ds => + sql( +s""" + |INSERT INTO TABLE $sourceTableName PARTITION (ds='$ds') + |SELECT * FROM src + """.stripMargin) + } + + // Create another table referring to the same location + sql( +s""" + |CREATE TABLE $tableName (key STRING, value STRING) + |PARTITIONED BY (ds STRING) + |LOCATION '$path' + """.stripMargin) + + // Register only one of the partitions found on disk + val ds = partitionDates.head + sql(s"ALTER TABLE $tableName ADD PARTITION (ds='$ds')").collect() + + // Analyze original table - expect 3 partitions + sql(s"ANALYZE TABLE $sourceTableName COMPUTE STATISTICS noscan") + assert(queryTotalSize(sourceTableName) === BigInt(3 * 5812)) + + // Analyze partial-copy table - expect only 1 partition + sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS noscan") + assert(queryTotalSize(tableName) === BigInt(5812)) --- End diff -- I am afraid this hard-coded values might not succeed in some other platforms. We might need to adjust the way. However, let me first merge this. If needed, we might need to submit a follow-up PR. cc @cloud-fan @wzhfy --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user mbasmanova commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r123242117 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -128,6 +129,45 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + test("SPARK-21079 - analyze table with location different than that of individual partitions") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +val tableName = "analyzeTable_part" +withTable(tableName) { + withTempPaths(4) { +case tablePath :: partitionPaths => + sql( +s""" + |CREATE TABLE ${tableName} (key STRING, value STRING) PARTITIONED BY (ds STRING) + |LOCATION '${tablePath}' + """. + stripMargin).collect() + + val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03") + partitionDates.zip(partitionPaths).foreach { +case (ds, path) => + sql( +s""" + |ALTER TABLE ${tableName} ADD PARTITION (ds='${ds}') + |LOCATION '${path.toString}' +""". + stripMargin).collect() + sql( +s""" + |INSERT INTO TABLE ${tableName} PARTITION (ds='${ds}') + |SELECT * FROM src +""". + stripMargin).collect() + } + + sql(s"ANALYZE TABLE ${tableName} COMPUTE STATISTICS noscan") + + assert(queryTotalSize(tableName) === BigInt(17436)) --- End diff -- Ok. I'll simplify the test and remove withTempPaths function. Since I need to make sure that partition locations are not under table location, I'll include ALTER TABLE $tableName SET LOCATION '$path' after creating table with partitions in the default location. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r123163236 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -128,6 +129,93 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + test("SPARK-21079 - analyze table with location different than that of individual partitions") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +val tableName = "analyzeTable_part" +withTable(tableName) { + // Create 4 paths: one to use as table location and one for each of the 3 partitions + withTempPaths(numPaths = 4) { +case tablePath :: partitionPaths => + sql( +s""" + |CREATE TABLE ${tableName} (key STRING, value STRING) PARTITIONED BY (ds STRING) + |LOCATION '${tablePath}' + """.stripMargin).collect() + + val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03") + partitionDates.zip(partitionPaths).foreach { +case (ds, path) => + sql( +s""" + |ALTER TABLE ${tableName} ADD PARTITION (ds='${ds}') + |LOCATION '${path.toString}' +""".stripMargin).collect() + sql( +s""" + |INSERT INTO TABLE ${tableName} PARTITION (ds='${ds}') + |SELECT * FROM src +""".stripMargin).collect() + } + + sql(s"ANALYZE TABLE ${tableName} COMPUTE STATISTICS noscan") + + assert(queryTotalSize(tableName) === BigInt(17436)) + } +} + } + + test("SPARK-21079 - analyze partitioned table with only a subset of partitions visible") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +val sourceTableName = "analyzeTable_part" +val tableName = "analyzeTable_part_vis" +withTable(sourceTableName, tableName) { + withTempPath { +path => + // Create a table with 3 partitions all located under a single top-level directory 'path' + sql( +s""" + |CREATE TABLE ${sourceTableName} (key STRING, value STRING) + |PARTITIONED BY (ds STRING) + |LOCATION '${path}' + """.stripMargin).collect() + + val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03") + partitionDates.foreach { +ds => + sql( +s""" + |INSERT INTO TABLE ${sourceTableName} PARTITION (ds='${ds}') + |SELECT * FROM src + """.stripMargin).collect() + } + + // Create another table referring to the same location + sql( +s""" + |CREATE TABLE ${tableName} (key STRING, value STRING) + |PARTITIONED BY (ds STRING) + |LOCATION '${path}' + """.stripMargin).collect() --- End diff -- The same here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r123163189 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -128,6 +129,93 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + test("SPARK-21079 - analyze table with location different than that of individual partitions") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +val tableName = "analyzeTable_part" +withTable(tableName) { + // Create 4 paths: one to use as table location and one for each of the 3 partitions + withTempPaths(numPaths = 4) { +case tablePath :: partitionPaths => + sql( +s""" + |CREATE TABLE ${tableName} (key STRING, value STRING) PARTITIONED BY (ds STRING) + |LOCATION '${tablePath}' + """.stripMargin).collect() + + val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03") + partitionDates.zip(partitionPaths).foreach { +case (ds, path) => + sql( +s""" + |ALTER TABLE ${tableName} ADD PARTITION (ds='${ds}') + |LOCATION '${path.toString}' +""".stripMargin).collect() + sql( +s""" + |INSERT INTO TABLE ${tableName} PARTITION (ds='${ds}') + |SELECT * FROM src +""".stripMargin).collect() + } + + sql(s"ANALYZE TABLE ${tableName} COMPUTE STATISTICS noscan") + + assert(queryTotalSize(tableName) === BigInt(17436)) + } +} + } + + test("SPARK-21079 - analyze partitioned table with only a subset of partitions visible") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +val sourceTableName = "analyzeTable_part" +val tableName = "analyzeTable_part_vis" +withTable(sourceTableName, tableName) { + withTempPath { +path => + // Create a table with 3 partitions all located under a single top-level directory 'path' + sql( +s""" + |CREATE TABLE ${sourceTableName} (key STRING, value STRING) + |PARTITIONED BY (ds STRING) + |LOCATION '${path}' + """.stripMargin).collect() + + val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03") + partitionDates.foreach { +ds => + sql( +s""" + |INSERT INTO TABLE ${sourceTableName} PARTITION (ds='${ds}') --- End diff -- `${sourceTableName}` -> `$sourceTableName` `${ds}` -> `$ds` Please fix the other similar issues --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r123163221 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -128,6 +129,93 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + test("SPARK-21079 - analyze table with location different than that of individual partitions") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +val tableName = "analyzeTable_part" +withTable(tableName) { + // Create 4 paths: one to use as table location and one for each of the 3 partitions + withTempPaths(numPaths = 4) { +case tablePath :: partitionPaths => + sql( +s""" + |CREATE TABLE ${tableName} (key STRING, value STRING) PARTITIONED BY (ds STRING) + |LOCATION '${tablePath}' + """.stripMargin).collect() + + val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03") + partitionDates.zip(partitionPaths).foreach { +case (ds, path) => + sql( +s""" + |ALTER TABLE ${tableName} ADD PARTITION (ds='${ds}') + |LOCATION '${path.toString}' +""".stripMargin).collect() + sql( +s""" + |INSERT INTO TABLE ${tableName} PARTITION (ds='${ds}') + |SELECT * FROM src +""".stripMargin).collect() + } + + sql(s"ANALYZE TABLE ${tableName} COMPUTE STATISTICS noscan") + + assert(queryTotalSize(tableName) === BigInt(17436)) + } +} + } + + test("SPARK-21079 - analyze partitioned table with only a subset of partitions visible") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +val sourceTableName = "analyzeTable_part" +val tableName = "analyzeTable_part_vis" +withTable(sourceTableName, tableName) { + withTempPath { +path => + // Create a table with 3 partitions all located under a single top-level directory 'path' + sql( +s""" + |CREATE TABLE ${sourceTableName} (key STRING, value STRING) + |PARTITIONED BY (ds STRING) + |LOCATION '${path}' + """.stripMargin).collect() + + val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03") + partitionDates.foreach { +ds => + sql( +s""" + |INSERT INTO TABLE ${sourceTableName} PARTITION (ds='${ds}') + |SELECT * FROM src + """.stripMargin).collect() --- End diff -- No need to call `collect()` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r123163104 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -128,6 +129,93 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + test("SPARK-21079 - analyze table with location different than that of individual partitions") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +val tableName = "analyzeTable_part" +withTable(tableName) { + // Create 4 paths: one to use as table location and one for each of the 3 partitions + withTempPaths(numPaths = 4) { +case tablePath :: partitionPaths => + sql( +s""" + |CREATE TABLE ${tableName} (key STRING, value STRING) PARTITIONED BY (ds STRING) + |LOCATION '${tablePath}' + """.stripMargin).collect() + + val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03") + partitionDates.zip(partitionPaths).foreach { +case (ds, path) => + sql( +s""" + |ALTER TABLE ${tableName} ADD PARTITION (ds='${ds}') + |LOCATION '${path.toString}' +""".stripMargin).collect() + sql( +s""" + |INSERT INTO TABLE ${tableName} PARTITION (ds='${ds}') + |SELECT * FROM src +""".stripMargin).collect() + } + + sql(s"ANALYZE TABLE ${tableName} COMPUTE STATISTICS noscan") + + assert(queryTotalSize(tableName) === BigInt(17436)) + } +} + } + + test("SPARK-21079 - analyze partitioned table with only a subset of partitions visible") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +val sourceTableName = "analyzeTable_part" +val tableName = "analyzeTable_part_vis" +withTable(sourceTableName, tableName) { + withTempPath { +path => + // Create a table with 3 partitions all located under a single top-level directory 'path' + sql( +s""" + |CREATE TABLE ${sourceTableName} (key STRING, value STRING) + |PARTITIONED BY (ds STRING) + |LOCATION '${path}' + """.stripMargin).collect() + + val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03") + partitionDates.foreach { +ds => --- End diff -- combine line 187 and 188 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r123163000 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -128,6 +129,45 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + test("SPARK-21079 - analyze table with location different than that of individual partitions") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +val tableName = "analyzeTable_part" +withTable(tableName) { + withTempPaths(4) { +case tablePath :: partitionPaths => + sql( +s""" + |CREATE TABLE ${tableName} (key STRING, value STRING) PARTITIONED BY (ds STRING) + |LOCATION '${tablePath}' + """. + stripMargin).collect() + + val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03") + partitionDates.zip(partitionPaths).foreach { +case (ds, path) => + sql( +s""" + |ALTER TABLE ${tableName} ADD PARTITION (ds='${ds}') + |LOCATION '${path.toString}' +""". + stripMargin).collect() + sql( +s""" + |INSERT INTO TABLE ${tableName} PARTITION (ds='${ds}') + |SELECT * FROM src +""". + stripMargin).collect() + } + + sql(s"ANALYZE TABLE ${tableName} COMPUTE STATISTICS noscan") + + assert(queryTotalSize(tableName) === BigInt(17436)) --- End diff -- We do not need such complex test cases for verifying the fix by this PR. In the future, if needed, we can add this --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r123163048 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -128,6 +129,93 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + test("SPARK-21079 - analyze table with location different than that of individual partitions") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +val tableName = "analyzeTable_part" +withTable(tableName) { + // Create 4 paths: one to use as table location and one for each of the 3 partitions + withTempPaths(numPaths = 4) { +case tablePath :: partitionPaths => + sql( +s""" + |CREATE TABLE ${tableName} (key STRING, value STRING) PARTITIONED BY (ds STRING) + |LOCATION '${tablePath}' + """.stripMargin).collect() + + val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03") + partitionDates.zip(partitionPaths).foreach { +case (ds, path) => + sql( +s""" + |ALTER TABLE ${tableName} ADD PARTITION (ds='${ds}') + |LOCATION '${path.toString}' +""".stripMargin).collect() + sql( +s""" + |INSERT INTO TABLE ${tableName} PARTITION (ds='${ds}') + |SELECT * FROM src +""".stripMargin).collect() + } + + sql(s"ANALYZE TABLE ${tableName} COMPUTE STATISTICS noscan") + + assert(queryTotalSize(tableName) === BigInt(17436)) + } +} + } + + test("SPARK-21079 - analyze partitioned table with only a subset of partitions visible") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +val sourceTableName = "analyzeTable_part" +val tableName = "analyzeTable_part_vis" +withTable(sourceTableName, tableName) { + withTempPath { +path => --- End diff -- Nit: combine line 176 and 177 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user mbasmanova commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r122651779 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -128,6 +129,45 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + test("SPARK-21079 - analyze table with location different than that of individual partitions") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +val tableName = "analyzeTable_part" +withTable(tableName) { + withTempPaths(4) { +case tablePath :: partitionPaths => + sql( +s""" + |CREATE TABLE ${tableName} (key STRING, value STRING) PARTITIONED BY (ds STRING) + |LOCATION '${tablePath}' + """. + stripMargin).collect() + + val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03") + partitionDates.zip(partitionPaths).foreach { +case (ds, path) => + sql( +s""" + |ALTER TABLE ${tableName} ADD PARTITION (ds='${ds}') + |LOCATION '${path.toString}' +""". + stripMargin).collect() + sql( +s""" + |INSERT INTO TABLE ${tableName} PARTITION (ds='${ds}') + |SELECT * FROM src +""". + stripMargin).collect() + } + + sql(s"ANALYZE TABLE ${tableName} COMPUTE STATISTICS noscan") + + assert(queryTotalSize(tableName) === BigInt(17436)) --- End diff -- This is how I wrote the test initially. In this case all partitions are located under the same top-level directory, but table-level location is somewhere else. I modified the test to use different paths for each partition as well as the table to address some of the earlier comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user mbasmanova commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r122649017 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala --- @@ -126,6 +127,27 @@ private[sql] trait SQLTestUtils } /** + * Creates the requested number of temporary path (without creating the actual file/directory), --- End diff -- I wanted to create a multi-path version of an existing withTempPath function. That function returns a *valid*, but non-existent path. The comments and path.delete() part of the implementation came from there. /** * Generates a temporary path without creating the actual file/directory, then pass it to `f`. If * a file/directory is created there by `f`, it will be delete after `f` returns. * * @todo Probably this method should be moved to a more general place */ protected def withTempPath(f: File => Unit): Unit = { val path = Utils.createTempDir() path.delete() try f(path) finally Utils.deleteRecursively(path) } --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r122600956 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -128,6 +129,45 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + test("SPARK-21079 - analyze table with location different than that of individual partitions") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +val tableName = "analyzeTable_part" +withTable(tableName) { + withTempPaths(4) { +case tablePath :: partitionPaths => + sql( +s""" + |CREATE TABLE ${tableName} (key STRING, value STRING) PARTITIONED BY (ds STRING) + |LOCATION '${tablePath}' + """. + stripMargin).collect() + + val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03") + partitionDates.zip(partitionPaths).foreach { +case (ds, path) => + sql( +s""" + |ALTER TABLE ${tableName} ADD PARTITION (ds='${ds}') + |LOCATION '${path.toString}' +""". + stripMargin).collect() + sql( +s""" + |INSERT INTO TABLE ${tableName} PARTITION (ds='${ds}') + |SELECT * FROM src +""". + stripMargin).collect() + } + + sql(s"ANALYZE TABLE ${tableName} COMPUTE STATISTICS noscan") + + assert(queryTotalSize(tableName) === BigInt(17436)) --- End diff -- After the test case simplification, the new utility function `withTempPaths` is not needed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r122600944 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -128,6 +129,45 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + test("SPARK-21079 - analyze table with location different than that of individual partitions") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +val tableName = "analyzeTable_part" +withTable(tableName) { + withTempPaths(4) { +case tablePath :: partitionPaths => + sql( +s""" + |CREATE TABLE ${tableName} (key STRING, value STRING) PARTITIONED BY (ds STRING) + |LOCATION '${tablePath}' + """. + stripMargin).collect() + + val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03") + partitionDates.zip(partitionPaths).foreach { +case (ds, path) => + sql( +s""" + |ALTER TABLE ${tableName} ADD PARTITION (ds='${ds}') + |LOCATION '${path.toString}' +""". + stripMargin).collect() + sql( +s""" + |INSERT INTO TABLE ${tableName} PARTITION (ds='${ds}') + |SELECT * FROM src +""". + stripMargin).collect() + } + + sql(s"ANALYZE TABLE ${tableName} COMPUTE STATISTICS noscan") + + assert(queryTotalSize(tableName) === BigInt(17436)) --- End diff -- It can be simplified to ```Scala val tableName = "analyzeTable_part" withTable(tableName) { withTempPath { path => sql( s""" |CREATE TABLE $tableName (key STRING, value STRING) PARTITIONED BY (ds STRING) |LOCATION '$path' """. stripMargin) val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03") partitionDates.foreach { ds => sql( s""" |INSERT INTO TABLE $tableName PARTITION (ds='$ds') |SELECT * FROM src """.stripMargin) } sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS noscan") assert( spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes === BigInt(17436)) } } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r122600822 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -128,6 +129,45 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + test("SPARK-21079 - analyze table with location different than that of individual partitions") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +val tableName = "analyzeTable_part" +withTable(tableName) { + withTempPaths(4) { --- End diff -- `numPaths` -> `numPaths = 4` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r122600813 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala --- @@ -126,6 +127,27 @@ private[sql] trait SQLTestUtils } /** + * Creates the requested number of temporary path (without creating the actual file/directory), + * which are then passed to f and will be deleted after f returns. + * + * @param num Number of directories to create --- End diff -- `num` -> `numPaths` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r122599863 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala --- @@ -81,6 +83,21 @@ case class AnalyzeTableCommand( object AnalyzeTableCommand extends Logging { def calculateTotalSize(sessionState: SessionState, catalogTable: CatalogTable): Long = { +if (catalogTable.partitionColumnNames.isEmpty) { + calculateLocationSize(sessionState, catalogTable.identifier, catalogTable.storage.locationUri) +} else { + // Calculate table size as a sum of its partitions. See SPARK-21079 --- End diff -- `its partitions` -> `the visible partitions` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r122581751 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala --- @@ -126,6 +127,27 @@ private[sql] trait SQLTestUtils } /** + * Creates the requested number of temporary path (without creating the actual file/directory), + * which are then passed to f and will be deleted after f returns. + * + * @param num Number of directories to create + * @param f Function to invoke with the created paths + */ + protected def withTempPaths(num: Int)(f: List[File] => Unit) { +val paths = mutable.ListBuffer[File]() +for (i <- 0 until num) { + val path = Utils.createTempDir().getCanonicalFile + path.delete() --- End diff -- why do we need to delete here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r122581748 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala --- @@ -126,6 +127,27 @@ private[sql] trait SQLTestUtils } /** + * Creates the requested number of temporary path (without creating the actual file/directory), --- End diff -- > without creating the actual file/directory what do you mean by "actual file"? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r122581755 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -128,6 +129,45 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + test("SPARK-21079 - analyze table with location different than that of individual partitions") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +val tableName = "analyzeTable_part" +withTable(tableName) { + withTempPaths(4) { --- End diff -- add a line of comment to explain the number 4 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r122581756 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -128,6 +129,45 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + test("SPARK-21079 - analyze table with location different than that of individual partitions") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +val tableName = "analyzeTable_part" +withTable(tableName) { + withTempPaths(4) { +case tablePath :: partitionPaths => + sql( +s""" + |CREATE TABLE ${tableName} (key STRING, value STRING) PARTITIONED BY (ds STRING) + |LOCATION '${tablePath}' + """. + stripMargin).collect() + + val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03") + partitionDates.zip(partitionPaths).foreach { +case (ds, path) => + sql( +s""" + |ALTER TABLE ${tableName} ADD PARTITION (ds='${ds}') + |LOCATION '${path.toString}' +""". + stripMargin).collect() --- End diff -- nit: """.stripMargin) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user mbasmanova commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r122505126 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -128,6 +129,48 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + test("SPARK-21079 - analyze table with location different than that of individual partitions") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +withTempPaths(4) { (paths) => paths match { --- End diff -- I need a path for the table and one more path for each of the three partitions. 1 + 3 = 4. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r122499294 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -128,6 +129,48 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + test("SPARK-21079 - analyze table with location different than that of individual partitions") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +withTempPaths(4) { (paths) => paths match { + case tablePath :: partitionPaths => +sql( + s""" +|CREATE TABLE analyzeTable_part (key STRING, value STRING) PARTITIONED BY (ds STRING) +|LOCATION '${tablePath}' + """.stripMargin).collect() + +val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03") +partitionDates.zip(partitionPaths).foreach(p => { + val ds = p._1 + val path = p._2 + sql( +s""" + |ALTER TABLE analyzeTable_part ADD PARTITION (ds='${ds}') + |LOCATION '${path.toString}' +""".stripMargin).collect() + sql( +s""" + |INSERT INTO TABLE analyzeTable_part PARTITION (ds='${ds}') + |SELECT * FROM src +""". + stripMargin).collect() + } +) + +// Modify table location to not match location of individual partitions +sql("ALTER TABLE analyzeTable_part SET LOCATION 'file:/do/not/use'").collect() + +sql("ANALYZE TABLE analyzeTable_part COMPUTE STATISTICS noscan") + +assert(queryTotalSize("analyzeTable_part") === BigInt(17436)) + +sql("DROP TABLE analyzeTable_part").collect() --- End diff -- use `withTable` to include the test, then we don't need to drop table explicitly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r122497283 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala --- @@ -126,6 +126,35 @@ private[sql] trait SQLTestUtils } /** + * Generates the requested number of temporary paths without creating the + * actual files/directories and passes these to the provided function 'f'. + * Deletes any files/directories left at these paths after 'f' returns. --- End diff -- Creates the requested number of temporary directories, which are then passed to `f` and will be deleted after `f` returns. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r122498969 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -128,6 +129,48 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + test("SPARK-21079 - analyze table with location different than that of individual partitions") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +withTempPaths(4) { (paths) => paths match { --- End diff -- 4 -> 3? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r122498459 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -128,6 +129,48 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + test("SPARK-21079 - analyze table with location different than that of individual partitions") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +withTempPaths(4) { (paths) => paths match { + case tablePath :: partitionPaths => --- End diff -- I think `withTempPaths(4) { paths =>` is ok, the pattern match seems unnecessary? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r122498658 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -128,6 +129,48 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + test("SPARK-21079 - analyze table with location different than that of individual partitions") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +withTempPaths(4) { (paths) => paths match { + case tablePath :: partitionPaths => +sql( + s""" +|CREATE TABLE analyzeTable_part (key STRING, value STRING) PARTITIONED BY (ds STRING) +|LOCATION '${tablePath}' + """.stripMargin).collect() + +val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03") +partitionDates.zip(partitionPaths).foreach(p => { --- End diff -- partitionDates.zip(partitionPaths).foreach { case (ds, path) => --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r122497071 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala --- @@ -126,6 +126,35 @@ private[sql] trait SQLTestUtils } /** + * Generates the requested number of temporary paths without creating the + * actual files/directories and passes these to the provided function 'f'. + * Deletes any files/directories left at these paths after 'f' returns. + * + * @param numPaths Number of paths to create + * @param f Function to invoke with the created paths + */ + protected def withTempPaths(numPaths: Int)(f: List[File] => Unit) { +def addPaths(numPaths: Int, paths: List[File]): List[File] = { + + if (numPaths <= 0) { +paths + } else { +val path = Utils.createTempDir() +path.delete() +addPaths(numPaths - 1, path :: paths) + } +} + +val paths = addPaths(numPaths, List.empty) + +try { + f(paths) +} finally { + paths.foreach(path => Utils.deleteRecursively(path)) +} + } --- End diff -- I personally think It's strange to use a recursive function, does the following modified function work for your test? ``` protected def withTempDirs(num: Int)(body: Seq[File] => Unit) { val files = mutable.Buffer[File]() for (i <- 0 until num) { files += Utils.createTempDir().getCanonicalFile } try { body(files) } finally { files.foreach(Utils.deleteRecursively) } } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r122275444 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -128,6 +128,40 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + test("SPARK-21079 - analyze table with location different than that of individual partitions") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +sql( + """ +|CREATE TABLE analyzeTable_part (key STRING, value STRING) PARTITIONED BY (ds STRING) + """.stripMargin).collect() +sql( + """ +|INSERT INTO TABLE analyzeTable_part PARTITION (ds='2010-01-01') +|SELECT * FROM src + """.stripMargin).collect() +sql( + """ +|INSERT INTO TABLE analyzeTable_part PARTITION (ds='2010-01-02') +|SELECT * FROM src + """.stripMargin).collect() +sql( + """ +|INSERT INTO TABLE analyzeTable_part PARTITION (ds='2010-01-03') +|SELECT * FROM src + """.stripMargin).collect() + +// Modify table location to not match location of individual partitions +sql("ALTER TABLE analyzeTable_part SET LOCATION 'file:/do/not/use'").collect() --- End diff -- BTW, we need to move `withTempDirs` from `FileStreamSourceSuite` to `SQLTestUtils`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r122268037 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -128,6 +128,40 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + test("SPARK-21079 - analyze table with location different than that of individual partitions") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +sql( + """ +|CREATE TABLE analyzeTable_part (key STRING, value STRING) PARTITIONED BY (ds STRING) + """.stripMargin).collect() +sql( + """ +|INSERT INTO TABLE analyzeTable_part PARTITION (ds='2010-01-01') +|SELECT * FROM src + """.stripMargin).collect() +sql( + """ +|INSERT INTO TABLE analyzeTable_part PARTITION (ds='2010-01-02') +|SELECT * FROM src + """.stripMargin).collect() +sql( + """ +|INSERT INTO TABLE analyzeTable_part PARTITION (ds='2010-01-03') +|SELECT * FROM src + """.stripMargin).collect() + +// Modify table location to not match location of individual partitions +sql("ALTER TABLE analyzeTable_part SET LOCATION 'file:/do/not/use'").collect() --- End diff -- please use `withTempDirs` to get several tmp dirs (deletion is guaranteed in `withTempDirs`) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r122267467 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -128,6 +128,40 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + test("SPARK-21079 - analyze table with location different than that of individual partitions") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +sql( + """ +|CREATE TABLE analyzeTable_part (key STRING, value STRING) PARTITIONED BY (ds STRING) + """.stripMargin).collect() +sql( + """ +|INSERT INTO TABLE analyzeTable_part PARTITION (ds='2010-01-01') +|SELECT * FROM src + """.stripMargin).collect() +sql( + """ +|INSERT INTO TABLE analyzeTable_part PARTITION (ds='2010-01-02') +|SELECT * FROM src + """.stripMargin).collect() +sql( --- End diff -- I agree with @tejasapatil . The test case would be more robust in that way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user mbasmanova commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r14357 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -128,6 +128,40 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + test("SPARK-21079 - analyze table with location different than that of individual partitions") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +sql( + """ +|CREATE TABLE analyzeTable_part (key STRING, value STRING) PARTITIONED BY (ds STRING) + """.stripMargin).collect() +sql( + """ +|INSERT INTO TABLE analyzeTable_part PARTITION (ds='2010-01-01') +|SELECT * FROM src + """.stripMargin).collect() +sql( + """ +|INSERT INTO TABLE analyzeTable_part PARTITION (ds='2010-01-02') +|SELECT * FROM src + """.stripMargin).collect() +sql( --- End diff -- This is a fine suggestion, but I feel it falls outside of the scope of this change. This particular fix concerns only the case when table-level location is not a parent directory of a partition-level location. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r122217829 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala --- @@ -109,16 +124,16 @@ object AnalyzeTableCommand extends Logging { size } -catalogTable.storage.locationUri.map { p => +locationUri.map { p => val path = new Path(p) try { val fs = path.getFileSystem(sessionState.newHadoopConf()) calculateTableSize(fs, path) --- End diff -- looks like this comment got missed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r122217717 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -128,6 +128,40 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = false) } + test("SPARK-21079 - analyze table with location different than that of individual partitions") { +def queryTotalSize(tableName: String): BigInt = + spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes + +sql( + """ +|CREATE TABLE analyzeTable_part (key STRING, value STRING) PARTITIONED BY (ds STRING) + """.stripMargin).collect() +sql( + """ +|INSERT INTO TABLE analyzeTable_part PARTITION (ds='2010-01-01') +|SELECT * FROM src + """.stripMargin).collect() +sql( + """ +|INSERT INTO TABLE analyzeTable_part PARTITION (ds='2010-01-02') +|SELECT * FROM src + """.stripMargin).collect() +sql( --- End diff -- All the three partitions point reside under the same base dir. In reality it might be possible that all three are pointing to different locations. You could simulate that by creating empty partition (`ALTER TABLE ... ADD PARTITION`), then doing a `ALTER TABLE table_name [PARTITION partition_spec] SET LOCATION "new location"` and finally inserting data to the partition. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r122116932 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala --- @@ -81,6 +83,19 @@ case class AnalyzeTableCommand( object AnalyzeTableCommand extends Logging { def calculateTotalSize(sessionState: SessionState, catalogTable: CatalogTable): Long = { +if (catalogTable.partitionColumnNames.isEmpty) { + calculateTotalSize(sessionState, catalogTable.identifier, catalogTable.storage.locationUri) --- End diff -- rename `calculateLocationSize`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r122102333 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala --- @@ -81,6 +83,19 @@ case class AnalyzeTableCommand( object AnalyzeTableCommand extends Logging { def calculateTotalSize(sessionState: SessionState, catalogTable: CatalogTable): Long = { +if (catalogTable.partitionColumnNames.isEmpty) { + calculateTotalSize(sessionState, catalogTable.identifier, catalogTable.storage.locationUri) +} else { + // Table = Sum(partitions) + val partitions = sessionState.catalog.listPartitions(catalogTable.identifier) + partitions.map(p => +calculateTotalSize(sessionState, catalogTable.identifier, p.storage.locationUri) + ).sum +} + } + + private def calculateTotalSize(sessionState: SessionState, tableId: TableIdentifier, --- End diff -- nit: if the number of args exceeds line length, the code style is to have individual args on separate line with 4 spaces. ``` private def calculateTotalSize( sessionState: SessionState, tableId: TableIdentifier, locationUri: Option[URI]): Long = { // This method is mainly based on ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r122102110 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala --- @@ -81,6 +83,19 @@ case class AnalyzeTableCommand( object AnalyzeTableCommand extends Logging { def calculateTotalSize(sessionState: SessionState, catalogTable: CatalogTable): Long = { +if (catalogTable.partitionColumnNames.isEmpty) { + calculateTotalSize(sessionState, catalogTable.identifier, catalogTable.storage.locationUri) +} else { + // Table = Sum(partitions) --- End diff -- nit : you could explain it instead of putting out the formula --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/18309#discussion_r122102041 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala --- @@ -109,16 +124,16 @@ object AnalyzeTableCommand extends Logging { size } -catalogTable.storage.locationUri.map { p => +locationUri.map { p => val path = new Path(p) try { val fs = path.getFileSystem(sessionState.newHadoopConf()) calculateTableSize(fs, path) --- End diff -- given that after this change this would be either a table / partition, we should rename the method --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...
GitHub user mbasmanova opened a pull request: https://github.com/apache/spark/pull/18309 [SPARK-21079] [SQL] Calculate total size of a partition table as a sum of individual partitions ## What changes were proposed in this pull request? When calculating total size of a partitioned table, use storage URIs associated with individual partitions to identify the files which make up the table. CC: @wzhfy ## How was this patch tested? Ran ANALYZE TABLE xxx COMPUTE STATISTICS on a partitioned Hive table and verified that sizeInBytes is calculated correctly. Before this change, the size would be zero. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mbasmanova/spark mbasmanova-analyze-part-table Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18309.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18309 commit a1dbdd6f56e500586b399565a7f837800039bfb3 Author: Masha BasmanovaDate: 2017-06-15T00:24:47Z [SPARK-21079] [SQL] Calculate total size of a partition table as a sum of individual partitions --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org