subject:"\[GitHub\] spark pull request #18309\: \[SPARK\-21079\] \[SQL\] Calculate total size of a par..."

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-24 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18309


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-24 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r123890894
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -128,6 +129,77 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = 
false)
   }
 
+  test("SPARK-21079 - analyze table with location different than that of 
individual partitions") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+val tableName = "analyzeTable_part"
+withTable(tableName) {
+  withTempPath { path =>
+sql(s"CREATE TABLE $tableName (key STRING, value STRING) 
PARTITIONED BY (ds STRING)")
+
+val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03")
+partitionDates.foreach { ds =>
+  sql(s"INSERT INTO TABLE $tableName PARTITION (ds='$ds') SELECT * 
FROM src")
+}
+
+sql(s"ALTER TABLE $tableName SET LOCATION '$path'")
+
+sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS noscan")
+
+assert(queryTotalSize(tableName) === BigInt(17436))
+  }
+}
+  }
+
+  test("SPARK-21079 - analyze partitioned table with only a subset of 
partitions visible") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+val sourceTableName = "analyzeTable_part"
+val tableName = "analyzeTable_part_vis"
+withTable(sourceTableName, tableName) {
+  withTempPath { path =>
+  // Create a table with 3 partitions all located under a single 
top-level directory 'path'
+  sql(
+s"""
+   |CREATE TABLE $sourceTableName (key STRING, value STRING)
+   |PARTITIONED BY (ds STRING)
+   |LOCATION '$path'
+ """.stripMargin)
+
+  val partitionDates = List("2010-01-01", "2010-01-02", 
"2010-01-03")
+  partitionDates.foreach { ds =>
+  sql(
+s"""
+   |INSERT INTO TABLE $sourceTableName PARTITION (ds='$ds')
+   |SELECT * FROM src
+ """.stripMargin)
+  }
+
+  // Create another table referring to the same location
+  sql(
+s"""
+   |CREATE TABLE $tableName (key STRING, value STRING)
+   |PARTITIONED BY (ds STRING)
+   |LOCATION '$path'
+ """.stripMargin)
+
+  // Register only one of the partitions found on disk
+  val ds = partitionDates.head
+  sql(s"ALTER TABLE $tableName ADD PARTITION (ds='$ds')").collect()
+
+  // Analyze original table - expect 3 partitions
+  sql(s"ANALYZE TABLE $sourceTableName COMPUTE STATISTICS noscan")
+  assert(queryTotalSize(sourceTableName) === BigInt(3 * 5812))
+
+  // Analyze partial-copy table - expect only 1 partition
+  sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS noscan")
+  assert(queryTotalSize(tableName) === BigInt(5812))
--- End diff --

I am afraid this hard-coded values might not succeed in some other 
platforms. We might need to adjust the way. However, let me first merge this. 
If needed, we might need to submit a follow-up PR. cc @cloud-fan @wzhfy 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-21 Thread mbasmanova

Github user mbasmanova commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r123242117
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -128,6 +129,45 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = 
false)
   }
 
+  test("SPARK-21079 - analyze table with location different than that of 
individual partitions") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+val tableName = "analyzeTable_part"
+withTable(tableName) {
+  withTempPaths(4) {
+case tablePath :: partitionPaths =>
+  sql(
+s"""
+   |CREATE TABLE ${tableName} (key STRING, value STRING) 
PARTITIONED BY (ds STRING)
+   |LOCATION '${tablePath}'
+ """.
+  stripMargin).collect()
+
+  val partitionDates = List("2010-01-01", "2010-01-02", 
"2010-01-03")
+  partitionDates.zip(partitionPaths).foreach {
+case (ds, path) =>
+  sql(
+s"""
+   |ALTER TABLE ${tableName} ADD PARTITION (ds='${ds}')
+   |LOCATION '${path.toString}'
+""".
+  stripMargin).collect()
+  sql(
+s"""
+   |INSERT INTO TABLE ${tableName} PARTITION (ds='${ds}')
+   |SELECT * FROM src
+""".
+  stripMargin).collect()
+  }
+
+  sql(s"ANALYZE TABLE ${tableName} COMPUTE STATISTICS noscan")
+
+  assert(queryTotalSize(tableName) === BigInt(17436))
--- End diff --

Ok. I'll simplify the test and remove withTempPaths function. Since I need 
to make sure that partition locations are not under table location, I'll 
include ALTER TABLE $tableName SET LOCATION '$path' after creating table with 
partitions in the default location. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-21 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r123163236
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -128,6 +129,93 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = 
false)
   }
 
+  test("SPARK-21079 - analyze table with location different than that of 
individual partitions") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+val tableName = "analyzeTable_part"
+withTable(tableName) {
+  // Create 4 paths: one to use as table location and one for each of 
the 3 partitions
+  withTempPaths(numPaths = 4) {
+case tablePath :: partitionPaths =>
+  sql(
+s"""
+   |CREATE TABLE ${tableName} (key STRING, value STRING) 
PARTITIONED BY (ds STRING)
+   |LOCATION '${tablePath}'
+ """.stripMargin).collect()
+
+  val partitionDates = List("2010-01-01", "2010-01-02", 
"2010-01-03")
+  partitionDates.zip(partitionPaths).foreach {
+case (ds, path) =>
+  sql(
+s"""
+   |ALTER TABLE ${tableName} ADD PARTITION (ds='${ds}')
+   |LOCATION '${path.toString}'
+""".stripMargin).collect()
+  sql(
+s"""
+   |INSERT INTO TABLE ${tableName} PARTITION (ds='${ds}')
+   |SELECT * FROM src
+""".stripMargin).collect()
+  }
+
+  sql(s"ANALYZE TABLE ${tableName} COMPUTE STATISTICS noscan")
+
+  assert(queryTotalSize(tableName) === BigInt(17436))
+  }
+}
+  }
+
+  test("SPARK-21079 - analyze partitioned table with only a subset of 
partitions visible") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+val sourceTableName = "analyzeTable_part"
+val tableName = "analyzeTable_part_vis"
+withTable(sourceTableName, tableName) {
+  withTempPath {
+path =>
+  // Create a table with 3 partitions all located under a single 
top-level directory 'path'
+  sql(
+s"""
+   |CREATE TABLE ${sourceTableName} (key STRING, value STRING)
+   |PARTITIONED BY (ds STRING)
+   |LOCATION '${path}'
+ """.stripMargin).collect()
+
+  val partitionDates = List("2010-01-01", "2010-01-02", 
"2010-01-03")
+  partitionDates.foreach {
+ds =>
+  sql(
+s"""
+   |INSERT INTO TABLE ${sourceTableName} PARTITION 
(ds='${ds}')
+   |SELECT * FROM src
+ """.stripMargin).collect()
+  }
+
+  // Create another table referring to the same location
+  sql(
+s"""
+   |CREATE TABLE ${tableName} (key STRING, value STRING)
+   |PARTITIONED BY (ds STRING)
+   |LOCATION '${path}'
+ """.stripMargin).collect()
--- End diff --

The same here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-21 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r123163189
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -128,6 +129,93 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = 
false)
   }
 
+  test("SPARK-21079 - analyze table with location different than that of 
individual partitions") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+val tableName = "analyzeTable_part"
+withTable(tableName) {
+  // Create 4 paths: one to use as table location and one for each of 
the 3 partitions
+  withTempPaths(numPaths = 4) {
+case tablePath :: partitionPaths =>
+  sql(
+s"""
+   |CREATE TABLE ${tableName} (key STRING, value STRING) 
PARTITIONED BY (ds STRING)
+   |LOCATION '${tablePath}'
+ """.stripMargin).collect()
+
+  val partitionDates = List("2010-01-01", "2010-01-02", 
"2010-01-03")
+  partitionDates.zip(partitionPaths).foreach {
+case (ds, path) =>
+  sql(
+s"""
+   |ALTER TABLE ${tableName} ADD PARTITION (ds='${ds}')
+   |LOCATION '${path.toString}'
+""".stripMargin).collect()
+  sql(
+s"""
+   |INSERT INTO TABLE ${tableName} PARTITION (ds='${ds}')
+   |SELECT * FROM src
+""".stripMargin).collect()
+  }
+
+  sql(s"ANALYZE TABLE ${tableName} COMPUTE STATISTICS noscan")
+
+  assert(queryTotalSize(tableName) === BigInt(17436))
+  }
+}
+  }
+
+  test("SPARK-21079 - analyze partitioned table with only a subset of 
partitions visible") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+val sourceTableName = "analyzeTable_part"
+val tableName = "analyzeTable_part_vis"
+withTable(sourceTableName, tableName) {
+  withTempPath {
+path =>
+  // Create a table with 3 partitions all located under a single 
top-level directory 'path'
+  sql(
+s"""
+   |CREATE TABLE ${sourceTableName} (key STRING, value STRING)
+   |PARTITIONED BY (ds STRING)
+   |LOCATION '${path}'
+ """.stripMargin).collect()
+
+  val partitionDates = List("2010-01-01", "2010-01-02", 
"2010-01-03")
+  partitionDates.foreach {
+ds =>
+  sql(
+s"""
+   |INSERT INTO TABLE ${sourceTableName} PARTITION 
(ds='${ds}')
--- End diff --

`${sourceTableName}` -> `$sourceTableName`
`${ds}` -> `$ds`

Please fix the other similar issues


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-21 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r123163221
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -128,6 +129,93 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = 
false)
   }
 
+  test("SPARK-21079 - analyze table with location different than that of 
individual partitions") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+val tableName = "analyzeTable_part"
+withTable(tableName) {
+  // Create 4 paths: one to use as table location and one for each of 
the 3 partitions
+  withTempPaths(numPaths = 4) {
+case tablePath :: partitionPaths =>
+  sql(
+s"""
+   |CREATE TABLE ${tableName} (key STRING, value STRING) 
PARTITIONED BY (ds STRING)
+   |LOCATION '${tablePath}'
+ """.stripMargin).collect()
+
+  val partitionDates = List("2010-01-01", "2010-01-02", 
"2010-01-03")
+  partitionDates.zip(partitionPaths).foreach {
+case (ds, path) =>
+  sql(
+s"""
+   |ALTER TABLE ${tableName} ADD PARTITION (ds='${ds}')
+   |LOCATION '${path.toString}'
+""".stripMargin).collect()
+  sql(
+s"""
+   |INSERT INTO TABLE ${tableName} PARTITION (ds='${ds}')
+   |SELECT * FROM src
+""".stripMargin).collect()
+  }
+
+  sql(s"ANALYZE TABLE ${tableName} COMPUTE STATISTICS noscan")
+
+  assert(queryTotalSize(tableName) === BigInt(17436))
+  }
+}
+  }
+
+  test("SPARK-21079 - analyze partitioned table with only a subset of 
partitions visible") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+val sourceTableName = "analyzeTable_part"
+val tableName = "analyzeTable_part_vis"
+withTable(sourceTableName, tableName) {
+  withTempPath {
+path =>
+  // Create a table with 3 partitions all located under a single 
top-level directory 'path'
+  sql(
+s"""
+   |CREATE TABLE ${sourceTableName} (key STRING, value STRING)
+   |PARTITIONED BY (ds STRING)
+   |LOCATION '${path}'
+ """.stripMargin).collect()
+
+  val partitionDates = List("2010-01-01", "2010-01-02", 
"2010-01-03")
+  partitionDates.foreach {
+ds =>
+  sql(
+s"""
+   |INSERT INTO TABLE ${sourceTableName} PARTITION 
(ds='${ds}')
+   |SELECT * FROM src
+ """.stripMargin).collect()
--- End diff --

No need to call `collect()`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-21 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r123163104
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -128,6 +129,93 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = 
false)
   }
 
+  test("SPARK-21079 - analyze table with location different than that of 
individual partitions") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+val tableName = "analyzeTable_part"
+withTable(tableName) {
+  // Create 4 paths: one to use as table location and one for each of 
the 3 partitions
+  withTempPaths(numPaths = 4) {
+case tablePath :: partitionPaths =>
+  sql(
+s"""
+   |CREATE TABLE ${tableName} (key STRING, value STRING) 
PARTITIONED BY (ds STRING)
+   |LOCATION '${tablePath}'
+ """.stripMargin).collect()
+
+  val partitionDates = List("2010-01-01", "2010-01-02", 
"2010-01-03")
+  partitionDates.zip(partitionPaths).foreach {
+case (ds, path) =>
+  sql(
+s"""
+   |ALTER TABLE ${tableName} ADD PARTITION (ds='${ds}')
+   |LOCATION '${path.toString}'
+""".stripMargin).collect()
+  sql(
+s"""
+   |INSERT INTO TABLE ${tableName} PARTITION (ds='${ds}')
+   |SELECT * FROM src
+""".stripMargin).collect()
+  }
+
+  sql(s"ANALYZE TABLE ${tableName} COMPUTE STATISTICS noscan")
+
+  assert(queryTotalSize(tableName) === BigInt(17436))
+  }
+}
+  }
+
+  test("SPARK-21079 - analyze partitioned table with only a subset of 
partitions visible") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+val sourceTableName = "analyzeTable_part"
+val tableName = "analyzeTable_part_vis"
+withTable(sourceTableName, tableName) {
+  withTempPath {
+path =>
+  // Create a table with 3 partitions all located under a single 
top-level directory 'path'
+  sql(
+s"""
+   |CREATE TABLE ${sourceTableName} (key STRING, value STRING)
+   |PARTITIONED BY (ds STRING)
+   |LOCATION '${path}'
+ """.stripMargin).collect()
+
+  val partitionDates = List("2010-01-01", "2010-01-02", 
"2010-01-03")
+  partitionDates.foreach {
+ds =>
--- End diff --

combine line 187 and 188


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-21 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r123163000
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -128,6 +129,45 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = 
false)
   }
 
+  test("SPARK-21079 - analyze table with location different than that of 
individual partitions") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+val tableName = "analyzeTable_part"
+withTable(tableName) {
+  withTempPaths(4) {
+case tablePath :: partitionPaths =>
+  sql(
+s"""
+   |CREATE TABLE ${tableName} (key STRING, value STRING) 
PARTITIONED BY (ds STRING)
+   |LOCATION '${tablePath}'
+ """.
+  stripMargin).collect()
+
+  val partitionDates = List("2010-01-01", "2010-01-02", 
"2010-01-03")
+  partitionDates.zip(partitionPaths).foreach {
+case (ds, path) =>
+  sql(
+s"""
+   |ALTER TABLE ${tableName} ADD PARTITION (ds='${ds}')
+   |LOCATION '${path.toString}'
+""".
+  stripMargin).collect()
+  sql(
+s"""
+   |INSERT INTO TABLE ${tableName} PARTITION (ds='${ds}')
+   |SELECT * FROM src
+""".
+  stripMargin).collect()
+  }
+
+  sql(s"ANALYZE TABLE ${tableName} COMPUTE STATISTICS noscan")
+
+  assert(queryTotalSize(tableName) === BigInt(17436))
--- End diff --

We do not need such complex test cases for verifying the fix by this PR. In 
the future, if needed, we can add this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-21 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r123163048
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -128,6 +129,93 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = 
false)
   }
 
+  test("SPARK-21079 - analyze table with location different than that of 
individual partitions") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+val tableName = "analyzeTable_part"
+withTable(tableName) {
+  // Create 4 paths: one to use as table location and one for each of 
the 3 partitions
+  withTempPaths(numPaths = 4) {
+case tablePath :: partitionPaths =>
+  sql(
+s"""
+   |CREATE TABLE ${tableName} (key STRING, value STRING) 
PARTITIONED BY (ds STRING)
+   |LOCATION '${tablePath}'
+ """.stripMargin).collect()
+
+  val partitionDates = List("2010-01-01", "2010-01-02", 
"2010-01-03")
+  partitionDates.zip(partitionPaths).foreach {
+case (ds, path) =>
+  sql(
+s"""
+   |ALTER TABLE ${tableName} ADD PARTITION (ds='${ds}')
+   |LOCATION '${path.toString}'
+""".stripMargin).collect()
+  sql(
+s"""
+   |INSERT INTO TABLE ${tableName} PARTITION (ds='${ds}')
+   |SELECT * FROM src
+""".stripMargin).collect()
+  }
+
+  sql(s"ANALYZE TABLE ${tableName} COMPUTE STATISTICS noscan")
+
+  assert(queryTotalSize(tableName) === BigInt(17436))
+  }
+}
+  }
+
+  test("SPARK-21079 - analyze partitioned table with only a subset of 
partitions visible") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+val sourceTableName = "analyzeTable_part"
+val tableName = "analyzeTable_part_vis"
+withTable(sourceTableName, tableName) {
+  withTempPath {
+path =>
--- End diff --

Nit: combine line 176 and 177


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-19 Thread mbasmanova

Github user mbasmanova commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r122651779
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -128,6 +129,45 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = 
false)
   }
 
+  test("SPARK-21079 - analyze table with location different than that of 
individual partitions") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+val tableName = "analyzeTable_part"
+withTable(tableName) {
+  withTempPaths(4) {
+case tablePath :: partitionPaths =>
+  sql(
+s"""
+   |CREATE TABLE ${tableName} (key STRING, value STRING) 
PARTITIONED BY (ds STRING)
+   |LOCATION '${tablePath}'
+ """.
+  stripMargin).collect()
+
+  val partitionDates = List("2010-01-01", "2010-01-02", 
"2010-01-03")
+  partitionDates.zip(partitionPaths).foreach {
+case (ds, path) =>
+  sql(
+s"""
+   |ALTER TABLE ${tableName} ADD PARTITION (ds='${ds}')
+   |LOCATION '${path.toString}'
+""".
+  stripMargin).collect()
+  sql(
+s"""
+   |INSERT INTO TABLE ${tableName} PARTITION (ds='${ds}')
+   |SELECT * FROM src
+""".
+  stripMargin).collect()
+  }
+
+  sql(s"ANALYZE TABLE ${tableName} COMPUTE STATISTICS noscan")
+
+  assert(queryTotalSize(tableName) === BigInt(17436))
--- End diff --

This is how I wrote the test initially. In this case all partitions are 
located under the same top-level directory, but table-level location is 
somewhere else. I modified the test to use different paths for each partition 
as well as the table to address some of the earlier comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-19 Thread mbasmanova

Github user mbasmanova commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r122649017
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala ---
@@ -126,6 +127,27 @@ private[sql] trait SQLTestUtils
   }
 
   /**
+   * Creates the requested number of temporary path (without creating the 
actual file/directory),
--- End diff --

I wanted to create a multi-path version of an existing withTempPath 
function. That function returns a *valid*, but non-existent path. The comments 
and path.delete() part of the implementation came from there.

/**
   * Generates a temporary path without creating the actual file/directory, 
then pass it to `f`. If
   * a file/directory is created there by `f`, it will be delete after `f` 
returns.
   *
   * @todo Probably this method should be moved to a more general place
   */
  protected def withTempPath(f: File => Unit): Unit = {
val path = Utils.createTempDir()
path.delete()
try f(path) finally Utils.deleteRecursively(path)
  }


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r122600956
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -128,6 +129,45 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = 
false)
   }
 
+  test("SPARK-21079 - analyze table with location different than that of 
individual partitions") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+val tableName = "analyzeTable_part"
+withTable(tableName) {
+  withTempPaths(4) {
+case tablePath :: partitionPaths =>
+  sql(
+s"""
+   |CREATE TABLE ${tableName} (key STRING, value STRING) 
PARTITIONED BY (ds STRING)
+   |LOCATION '${tablePath}'
+ """.
+  stripMargin).collect()
+
+  val partitionDates = List("2010-01-01", "2010-01-02", 
"2010-01-03")
+  partitionDates.zip(partitionPaths).foreach {
+case (ds, path) =>
+  sql(
+s"""
+   |ALTER TABLE ${tableName} ADD PARTITION (ds='${ds}')
+   |LOCATION '${path.toString}'
+""".
+  stripMargin).collect()
+  sql(
+s"""
+   |INSERT INTO TABLE ${tableName} PARTITION (ds='${ds}')
+   |SELECT * FROM src
+""".
+  stripMargin).collect()
+  }
+
+  sql(s"ANALYZE TABLE ${tableName} COMPUTE STATISTICS noscan")
+
+  assert(queryTotalSize(tableName) === BigInt(17436))
--- End diff --

After the test case simplification, the new utility function 
`withTempPaths` is not needed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r122600944
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -128,6 +129,45 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = 
false)
   }
 
+  test("SPARK-21079 - analyze table with location different than that of 
individual partitions") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+val tableName = "analyzeTable_part"
+withTable(tableName) {
+  withTempPaths(4) {
+case tablePath :: partitionPaths =>
+  sql(
+s"""
+   |CREATE TABLE ${tableName} (key STRING, value STRING) 
PARTITIONED BY (ds STRING)
+   |LOCATION '${tablePath}'
+ """.
+  stripMargin).collect()
+
+  val partitionDates = List("2010-01-01", "2010-01-02", 
"2010-01-03")
+  partitionDates.zip(partitionPaths).foreach {
+case (ds, path) =>
+  sql(
+s"""
+   |ALTER TABLE ${tableName} ADD PARTITION (ds='${ds}')
+   |LOCATION '${path.toString}'
+""".
+  stripMargin).collect()
+  sql(
+s"""
+   |INSERT INTO TABLE ${tableName} PARTITION (ds='${ds}')
+   |SELECT * FROM src
+""".
+  stripMargin).collect()
+  }
+
+  sql(s"ANALYZE TABLE ${tableName} COMPUTE STATISTICS noscan")
+
+  assert(queryTotalSize(tableName) === BigInt(17436))
--- End diff --

It can be simplified to 
```Scala
val tableName = "analyzeTable_part"
withTable(tableName) {
  withTempPath { path =>
sql(
  s"""
 |CREATE TABLE $tableName (key STRING, value STRING) 
PARTITIONED BY (ds STRING)
 |LOCATION '$path'
   """.
stripMargin)

val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03")
partitionDates.foreach { ds =>
  sql(
s"""
   |INSERT INTO TABLE $tableName PARTITION (ds='$ds')
   |SELECT * FROM src
 """.stripMargin)
}

sql(s"ANALYZE TABLE $tableName COMPUTE STATISTICS noscan")

assert(
  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes === 
BigInt(17436))
  }
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r122600822
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -128,6 +129,45 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = 
false)
   }
 
+  test("SPARK-21079 - analyze table with location different than that of 
individual partitions") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+val tableName = "analyzeTable_part"
+withTable(tableName) {
+  withTempPaths(4) {
--- End diff --

`numPaths` -> `numPaths  = 4`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r122600813
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala ---
@@ -126,6 +127,27 @@ private[sql] trait SQLTestUtils
   }
 
   /**
+   * Creates the requested number of temporary path (without creating the 
actual file/directory),
+   * which are then passed to f and will be deleted after f returns.
+   *
+   * @param num Number of directories to create
--- End diff --

`num` -> `numPaths`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-18 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r122599863
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala
 ---
@@ -81,6 +83,21 @@ case class AnalyzeTableCommand(
 object AnalyzeTableCommand extends Logging {
 
   def calculateTotalSize(sessionState: SessionState, catalogTable: 
CatalogTable): Long = {
+if (catalogTable.partitionColumnNames.isEmpty) {
+  calculateLocationSize(sessionState, catalogTable.identifier, 
catalogTable.storage.locationUri)
+} else {
+  // Calculate table size as a sum of its partitions. See SPARK-21079
--- End diff --

`its partitions` -> `the visible partitions`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-17 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r122581751
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala ---
@@ -126,6 +127,27 @@ private[sql] trait SQLTestUtils
   }
 
   /**
+   * Creates the requested number of temporary path (without creating the 
actual file/directory),
+   * which are then passed to f and will be deleted after f returns.
+   *
+   * @param num Number of directories to create
+   * @param f Function to invoke with the created paths
+   */
+  protected def withTempPaths(num: Int)(f: List[File] => Unit) {
+val paths = mutable.ListBuffer[File]()
+for (i <- 0 until num) {
+  val path = Utils.createTempDir().getCanonicalFile
+  path.delete()
--- End diff --

why do we need to delete here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-17 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r122581748
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala ---
@@ -126,6 +127,27 @@ private[sql] trait SQLTestUtils
   }
 
   /**
+   * Creates the requested number of temporary path (without creating the 
actual file/directory),
--- End diff --

> without creating the actual file/directory

what do you mean by "actual file"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-17 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r122581755
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -128,6 +129,45 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = 
false)
   }
 
+  test("SPARK-21079 - analyze table with location different than that of 
individual partitions") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+val tableName = "analyzeTable_part"
+withTable(tableName) {
+  withTempPaths(4) {
--- End diff --

add a line of comment to explain the number 4


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-17 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r122581756
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -128,6 +129,45 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = 
false)
   }
 
+  test("SPARK-21079 - analyze table with location different than that of 
individual partitions") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+val tableName = "analyzeTable_part"
+withTable(tableName) {
+  withTempPaths(4) {
+case tablePath :: partitionPaths =>
+  sql(
+s"""
+   |CREATE TABLE ${tableName} (key STRING, value STRING) 
PARTITIONED BY (ds STRING)
+   |LOCATION '${tablePath}'
+ """.
+  stripMargin).collect()
+
+  val partitionDates = List("2010-01-01", "2010-01-02", 
"2010-01-03")
+  partitionDates.zip(partitionPaths).foreach {
+case (ds, path) =>
+  sql(
+s"""
+   |ALTER TABLE ${tableName} ADD PARTITION (ds='${ds}')
+   |LOCATION '${path.toString}'
+""".
+  stripMargin).collect()
--- End diff --

nit: """.stripMargin)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-16 Thread mbasmanova

Github user mbasmanova commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r122505126
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -128,6 +129,48 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = 
false)
   }
 
+  test("SPARK-21079 - analyze table with location different than that of 
individual partitions") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+withTempPaths(4) { (paths) => paths match {
--- End diff --

I need a path for the table and one more path for each of the three 
partitions. 1 + 3 = 4.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-16 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r122499294
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -128,6 +129,48 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = 
false)
   }
 
+  test("SPARK-21079 - analyze table with location different than that of 
individual partitions") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+withTempPaths(4) { (paths) => paths match {
+  case tablePath :: partitionPaths =>
+sql(
+  s"""
+|CREATE TABLE analyzeTable_part (key STRING, value STRING) 
PARTITIONED BY (ds STRING)
+|LOCATION '${tablePath}'
+  """.stripMargin).collect()
+
+val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03")
+partitionDates.zip(partitionPaths).foreach(p => {
+  val ds = p._1
+  val path = p._2
+  sql(
+s"""
+  |ALTER TABLE analyzeTable_part ADD PARTITION (ds='${ds}')
+  |LOCATION '${path.toString}'
+""".stripMargin).collect()
+  sql(
+s"""
+  |INSERT INTO TABLE analyzeTable_part PARTITION (ds='${ds}')
+  |SELECT * FROM src
+""".
+  stripMargin).collect()
+  }
+)
+
+// Modify table location to not match location of individual 
partitions
+sql("ALTER TABLE analyzeTable_part SET LOCATION 
'file:/do/not/use'").collect()
+
+sql("ANALYZE TABLE analyzeTable_part COMPUTE STATISTICS noscan")
+
+assert(queryTotalSize("analyzeTable_part") === BigInt(17436))
+
+sql("DROP TABLE analyzeTable_part").collect()
--- End diff --

use `withTable` to include the test, then we don't need to drop table 
explicitly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-16 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r122497283
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala ---
@@ -126,6 +126,35 @@ private[sql] trait SQLTestUtils
   }
 
   /**
+   * Generates the requested number of temporary paths without creating the
+   * actual files/directories and passes these to the provided function 
'f'.
+   * Deletes any files/directories left at these paths after 'f' returns.
--- End diff --

Creates the requested number of temporary directories, which are then 
passed to `f` and will be deleted after `f` returns.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-16 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r122498969
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -128,6 +129,48 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = 
false)
   }
 
+  test("SPARK-21079 - analyze table with location different than that of 
individual partitions") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+withTempPaths(4) { (paths) => paths match {
--- End diff --

4 -> 3?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-16 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r122498459
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -128,6 +129,48 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = 
false)
   }
 
+  test("SPARK-21079 - analyze table with location different than that of 
individual partitions") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+withTempPaths(4) { (paths) => paths match {
+  case tablePath :: partitionPaths =>
--- End diff --

I think `withTempPaths(4) { paths =>` is ok, the pattern match seems 
unnecessary?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-16 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r122498658
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -128,6 +129,48 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = 
false)
   }
 
+  test("SPARK-21079 - analyze table with location different than that of 
individual partitions") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+withTempPaths(4) { (paths) => paths match {
+  case tablePath :: partitionPaths =>
+sql(
+  s"""
+|CREATE TABLE analyzeTable_part (key STRING, value STRING) 
PARTITIONED BY (ds STRING)
+|LOCATION '${tablePath}'
+  """.stripMargin).collect()
+
+val partitionDates = List("2010-01-01", "2010-01-02", "2010-01-03")
+partitionDates.zip(partitionPaths).foreach(p => {
--- End diff --

partitionDates.zip(partitionPaths).foreach { case (ds, path) =>


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-16 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r122497071
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala ---
@@ -126,6 +126,35 @@ private[sql] trait SQLTestUtils
   }
 
   /**
+   * Generates the requested number of temporary paths without creating the
+   * actual files/directories and passes these to the provided function 
'f'.
+   * Deletes any files/directories left at these paths after 'f' returns.
+   *
+   * @param numPaths Number of paths to create
+   * @param f Function to invoke with the created paths
+   */
+  protected def withTempPaths(numPaths: Int)(f: List[File] => Unit) {
+def addPaths(numPaths: Int, paths: List[File]): List[File] = {
+
+  if (numPaths <= 0) {
+paths
+  } else {
+val path = Utils.createTempDir()
+path.delete()
+addPaths(numPaths - 1, path :: paths)
+  }
+}
+
+val paths = addPaths(numPaths, List.empty)
+
+try {
+  f(paths)
+} finally {
+  paths.foreach(path => Utils.deleteRecursively(path))
+}
+  }
--- End diff --

I personally think It's strange to use a recursive function, does the 
following modified function work for your test?
```

  protected def withTempDirs(num: Int)(body: Seq[File] => Unit) {
val files = mutable.Buffer[File]()
for (i <- 0 until num) {
  files += Utils.createTempDir().getCanonicalFile
}
try {
  body(files)
} finally {
  files.foreach(Utils.deleteRecursively)
}
  }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-15 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r122275444
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -128,6 +128,40 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = 
false)
   }
 
+  test("SPARK-21079 - analyze table with location different than that of 
individual partitions") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+sql(
+  """
+|CREATE TABLE analyzeTable_part (key STRING, value STRING) 
PARTITIONED BY (ds STRING)
+  """.stripMargin).collect()
+sql(
+  """
+|INSERT INTO TABLE analyzeTable_part PARTITION (ds='2010-01-01')
+|SELECT * FROM src
+  """.stripMargin).collect()
+sql(
+  """
+|INSERT INTO TABLE analyzeTable_part PARTITION (ds='2010-01-02')
+|SELECT * FROM src
+  """.stripMargin).collect()
+sql(
+  """
+|INSERT INTO TABLE analyzeTable_part PARTITION (ds='2010-01-03')
+|SELECT * FROM src
+  """.stripMargin).collect()
+
+// Modify table location to not match location of individual partitions
+sql("ALTER TABLE analyzeTable_part SET LOCATION 
'file:/do/not/use'").collect()
--- End diff --

BTW, we need to move `withTempDirs` from `FileStreamSourceSuite` to 
`SQLTestUtils`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-15 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r122268037
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -128,6 +128,40 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = 
false)
   }
 
+  test("SPARK-21079 - analyze table with location different than that of 
individual partitions") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+sql(
+  """
+|CREATE TABLE analyzeTable_part (key STRING, value STRING) 
PARTITIONED BY (ds STRING)
+  """.stripMargin).collect()
+sql(
+  """
+|INSERT INTO TABLE analyzeTable_part PARTITION (ds='2010-01-01')
+|SELECT * FROM src
+  """.stripMargin).collect()
+sql(
+  """
+|INSERT INTO TABLE analyzeTable_part PARTITION (ds='2010-01-02')
+|SELECT * FROM src
+  """.stripMargin).collect()
+sql(
+  """
+|INSERT INTO TABLE analyzeTable_part PARTITION (ds='2010-01-03')
+|SELECT * FROM src
+  """.stripMargin).collect()
+
+// Modify table location to not match location of individual partitions
+sql("ALTER TABLE analyzeTable_part SET LOCATION 
'file:/do/not/use'").collect()
--- End diff --

please use `withTempDirs` to get several tmp dirs (deletion is guaranteed 
in `withTempDirs`)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-15 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r122267467
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -128,6 +128,40 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = 
false)
   }
 
+  test("SPARK-21079 - analyze table with location different than that of 
individual partitions") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+sql(
+  """
+|CREATE TABLE analyzeTable_part (key STRING, value STRING) 
PARTITIONED BY (ds STRING)
+  """.stripMargin).collect()
+sql(
+  """
+|INSERT INTO TABLE analyzeTable_part PARTITION (ds='2010-01-01')
+|SELECT * FROM src
+  """.stripMargin).collect()
+sql(
+  """
+|INSERT INTO TABLE analyzeTable_part PARTITION (ds='2010-01-02')
+|SELECT * FROM src
+  """.stripMargin).collect()
+sql(
--- End diff --

I agree with @tejasapatil . The test case would be more robust in that way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-15 Thread mbasmanova

Github user mbasmanova commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r14357
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -128,6 +128,40 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = 
false)
   }
 
+  test("SPARK-21079 - analyze table with location different than that of 
individual partitions") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+sql(
+  """
+|CREATE TABLE analyzeTable_part (key STRING, value STRING) 
PARTITIONED BY (ds STRING)
+  """.stripMargin).collect()
+sql(
+  """
+|INSERT INTO TABLE analyzeTable_part PARTITION (ds='2010-01-01')
+|SELECT * FROM src
+  """.stripMargin).collect()
+sql(
+  """
+|INSERT INTO TABLE analyzeTable_part PARTITION (ds='2010-01-02')
+|SELECT * FROM src
+  """.stripMargin).collect()
+sql(
--- End diff --

This is a fine suggestion, but I feel it falls outside of the scope of this 
change. This particular fix concerns only the case when table-level location is 
not a parent directory of a partition-level location.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-15 Thread tejasapatil

Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r122217829
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala
 ---
@@ -109,16 +124,16 @@ object AnalyzeTableCommand extends Logging {
   size
 }
 
-catalogTable.storage.locationUri.map { p =>
+locationUri.map { p =>
   val path = new Path(p)
   try {
 val fs = path.getFileSystem(sessionState.newHadoopConf())
 calculateTableSize(fs, path)
--- End diff --

looks like this comment got missed 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-15 Thread tejasapatil

Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r122217717
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -128,6 +128,40 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   TableIdentifier("tempTable"), ignoreIfNotExists = true, purge = 
false)
   }
 
+  test("SPARK-21079 - analyze table with location different than that of 
individual partitions") {
+def queryTotalSize(tableName: String): BigInt =
+  
spark.table(tableName).queryExecution.analyzed.stats(conf).sizeInBytes
+
+sql(
+  """
+|CREATE TABLE analyzeTable_part (key STRING, value STRING) 
PARTITIONED BY (ds STRING)
+  """.stripMargin).collect()
+sql(
+  """
+|INSERT INTO TABLE analyzeTable_part PARTITION (ds='2010-01-01')
+|SELECT * FROM src
+  """.stripMargin).collect()
+sql(
+  """
+|INSERT INTO TABLE analyzeTable_part PARTITION (ds='2010-01-02')
+|SELECT * FROM src
+  """.stripMargin).collect()
+sql(
--- End diff --

All the three partitions point reside under the same base dir. In reality 
it might be possible that all three are pointing to different locations. You 
could simulate that by creating empty partition (`ALTER TABLE ... ADD 
PARTITION`), then doing a `ALTER TABLE table_name [PARTITION partition_spec] 
SET LOCATION "new location"` and finally inserting data to the partition.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-14 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r122116932
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala
 ---
@@ -81,6 +83,19 @@ case class AnalyzeTableCommand(
 object AnalyzeTableCommand extends Logging {
 
   def calculateTotalSize(sessionState: SessionState, catalogTable: 
CatalogTable): Long = {
+if (catalogTable.partitionColumnNames.isEmpty) {
+  calculateTotalSize(sessionState, catalogTable.identifier, 
catalogTable.storage.locationUri)
--- End diff --

rename `calculateLocationSize`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-14 Thread tejasapatil

Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r122102333
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala
 ---
@@ -81,6 +83,19 @@ case class AnalyzeTableCommand(
 object AnalyzeTableCommand extends Logging {
 
   def calculateTotalSize(sessionState: SessionState, catalogTable: 
CatalogTable): Long = {
+if (catalogTable.partitionColumnNames.isEmpty) {
+  calculateTotalSize(sessionState, catalogTable.identifier, 
catalogTable.storage.locationUri)
+} else {
+  // Table = Sum(partitions)
+  val partitions = 
sessionState.catalog.listPartitions(catalogTable.identifier)
+  partitions.map(p =>
+calculateTotalSize(sessionState, catalogTable.identifier, 
p.storage.locationUri)
+  ).sum
+}
+  }
+
+  private def calculateTotalSize(sessionState: SessionState, tableId: 
TableIdentifier,
--- End diff --

nit: if the number of args exceeds line length, the code style is to have 
individual args on separate line with 4 spaces.

```
  private def calculateTotalSize(
  sessionState: SessionState,
  tableId: TableIdentifier,
  locationUri: Option[URI]): Long = {
// This method is mainly based on

```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-14 Thread tejasapatil

Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r122102110
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala
 ---
@@ -81,6 +83,19 @@ case class AnalyzeTableCommand(
 object AnalyzeTableCommand extends Logging {
 
   def calculateTotalSize(sessionState: SessionState, catalogTable: 
CatalogTable): Long = {
+if (catalogTable.partitionColumnNames.isEmpty) {
+  calculateTotalSize(sessionState, catalogTable.identifier, 
catalogTable.storage.locationUri)
+} else {
+  // Table = Sum(partitions)
--- End diff --

nit : you could explain it instead of putting out the formula


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-14 Thread tejasapatil

Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/18309#discussion_r122102041
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala
 ---
@@ -109,16 +124,16 @@ object AnalyzeTableCommand extends Logging {
   size
 }
 
-catalogTable.storage.locationUri.map { p =>
+locationUri.map { p =>
   val path = new Path(p)
   try {
 val fs = path.getFileSystem(sessionState.newHadoopConf())
 calculateTableSize(fs, path)
--- End diff --

given that after this change this would be either a table / partition, we 
should rename the method


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18309: [SPARK-21079] [SQL] Calculate total size of a par...

2017-06-14 Thread mbasmanova

GitHub user mbasmanova opened a pull request:

https://github.com/apache/spark/pull/18309

[SPARK-21079] [SQL] Calculate total size of a partition table as a sum of 
individual partitions

## What changes were proposed in this pull request?

When calculating total size of a partitioned table, use storage URIs 
associated with individual partitions to identify the files which make up the 
table.

CC: @wzhfy

## How was this patch tested?

Ran ANALYZE TABLE xxx COMPUTE STATISTICS on a partitioned Hive table and 
verified that sizeInBytes is calculated correctly. Before this change, the size 
would be zero.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mbasmanova/spark mbasmanova-analyze-part-table

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18309.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18309


commit a1dbdd6f56e500586b399565a7f837800039bfb3
Author: Masha Basmanova 
Date:   2017-06-15T00:24:47Z

[SPARK-21079] [SQL] Calculate total size of a partition table as a sum of 
individual partitions




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

38 matches

Mail list logo