[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-30 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18334#discussion_r125154201
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
 ---
@@ -97,6 +106,10 @@ object CommandUtils extends Logging {
   0L
   }
 }.getOrElse(0L)
+val durationInMs = (System.nanoTime() - startTime) / (1000 * 1000)
+logInfo(s"It took $durationInMs ms to calculate the total file size 
under path $locationUri.")
--- End diff --

Actually, the log message contains the timestamp. It does not need to 
calculate the total time, but I think it is fine here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18334


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-30 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18334#discussion_r124977705
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -437,7 +437,20 @@ case class AlterTableAddPartitionCommand(
 }
 catalog.createPartitions(table.identifier, parts, ignoreIfExists = 
ifNotExists)
 
-CommandUtils.updateTableStats(sparkSession, table)
+if (table.stats.nonEmpty) {
+  if (sparkSession.sessionState.conf.autoUpdateSize) {
+val addedSize = parts.map { part =>
+  CommandUtils.calculateLocationSize(sparkSession.sessionState, 
table.identifier,
--- End diff --

INFO should be fine, because it should not be a lot, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-30 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18334#discussion_r124977390
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -774,6 +774,12 @@ object SQLConf {
   .doubleConf
   .createWithDefault(0.05)
 
+  val AUTO_UPDATE_SIZE =
+buildConf("spark.sql.statistics.autoUpdate.size")
+  .doc("Enables automatic update for table size once table's data is 
changed.")
--- End diff --

Yea, we should clearly document the influence of the flag. Thanks for 
pointing this out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-30 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18334#discussion_r124977456
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -437,7 +437,20 @@ case class AlterTableAddPartitionCommand(
 }
 catalog.createPartitions(table.identifier, parts, ignoreIfExists = 
ifNotExists)
 
-CommandUtils.updateTableStats(sparkSession, table)
+if (table.stats.nonEmpty) {
+  if (sparkSession.sessionState.conf.autoUpdateSize) {
+val addedSize = parts.map { part =>
+  CommandUtils.calculateLocationSize(sparkSession.sessionState, 
table.identifier,
--- End diff --

OK, shall we log in info level or debug level?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-29 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18334#discussion_r124971615
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -774,6 +774,12 @@ object SQLConf {
   .doubleConf
   .createWithDefault(0.05)
 
+  val AUTO_UPDATE_SIZE =
+buildConf("spark.sql.statistics.autoUpdate.size")
+  .doc("Enables automatic update for table size once table's data is 
changed.")
--- End diff --

This flag could slow down the whole data change commands. We need to 
clearly explain the potential performance regression. Based on the current 
description, all the users are willing to turn it on. Normally, the users 
expect we are doing the incremental updates, instead of recalculating it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-29 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18334#discussion_r124971431
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -437,7 +437,20 @@ case class AlterTableAddPartitionCommand(
 }
 catalog.createPartitions(table.identifier, parts, ignoreIfExists = 
ifNotExists)
 
-CommandUtils.updateTableStats(sparkSession, table)
+if (table.stats.nonEmpty) {
+  if (sparkSession.sessionState.conf.autoUpdateSize) {
+val addedSize = parts.map { part =>
+  CommandUtils.calculateLocationSize(sparkSession.sessionState, 
table.identifier,
--- End diff --

In the function `calculateLocationSize`, please add log messages when 
starting/finishing the statistics collection and 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-28 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18334#discussion_r124469169
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -448,6 +433,145 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   "ALTER TABLE unset_prop_table UNSET TBLPROPERTIES ('prop1')")
   }
 
+  /**
+   * To see if stats exist, we need to check spark's stats properties 
instead of catalog
+   * statistics, because hive would change stats in metastore and thus 
change catalog statistics.
+   */
+  private def getStatsProperties(tableName: String): Map[String, String] = 
{
+val hTable = 
hiveClient.getTable(spark.sessionState.catalog.getCurrentDatabase, tableName)
+hTable.properties.filterKeys(_.startsWith(STATISTICS_PREFIX))
+  }
+
+  test("change stats after insert command for hive table") {
+val table = s"change_stats_insert_hive_table"
+Seq(false, true).foreach { autoUpdate =>
+  withSQLConf(SQLConf.AUTO_UPDATE_SIZE.key -> autoUpdate.toString) {
+withTable(table) {
+  sql(s"CREATE TABLE $table (i int, j string)")
+  // analyze to get initial stats
+  sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS i, j")
+  val fetched1 = checkTableStats(table, hasSizeInBytes = true, 
expectedRowCounts = Some(0))
+  assert(fetched1.get.sizeInBytes == 0)
+  assert(fetched1.get.colStats.size == 2)
+
+  // insert into command
+  sql(s"INSERT INTO TABLE $table SELECT 1, 'abc'")
+  if (autoUpdate) {
+val fetched2 = checkTableStats(table, hasSizeInBytes = true, 
expectedRowCounts = None)
+assert(fetched2.get.sizeInBytes > 0)
+assert(fetched2.get.colStats.isEmpty)
+val statsProp = getStatsProperties(table)
+assert(statsProp(STATISTICS_TOTAL_SIZE).toLong == 
fetched2.get.sizeInBytes)
+  } else {
+assert(getStatsProperties(table).isEmpty)
+  }
+}
+  }
+}
+  }
+
+  test("change stats after load data command") {
+val table = "change_stats_load_table"
+Seq(false, true).foreach { autoUpdate =>
+  withSQLConf(SQLConf.AUTO_UPDATE_SIZE.key -> autoUpdate.toString) {
+withTable(table) {
+  sql(s"CREATE TABLE $table (i INT, j STRING) STORED AS PARQUET")
+  // analyze to get initial stats
+  sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS i, j")
+  val fetched1 = checkTableStats(table, hasSizeInBytes = true, 
expectedRowCounts = Some(0))
+  assert(fetched1.get.sizeInBytes == 0)
+  assert(fetched1.get.colStats.size == 2)
+
+  withTempDir { loadPath =>
+// load data command
+val file = new File(loadPath + "/data")
+val writer = new PrintWriter(file)
+writer.write("2,xyz")
+writer.close()
+sql(s"LOAD DATA INPATH '${loadPath.toURI.toString}' INTO TABLE 
$table")
+if (autoUpdate) {
+  val fetched2 = checkTableStats(table, hasSizeInBytes = true, 
expectedRowCounts = None)
+  assert(fetched2.get.sizeInBytes > 0)
+  assert(fetched2.get.colStats.isEmpty)
+  val statsProp = getStatsProperties(table)
+  assert(statsProp(STATISTICS_TOTAL_SIZE).toLong == 
fetched2.get.sizeInBytes)
+} else {
+  assert(getStatsProperties(table).isEmpty)
+}
+  }
+}
+  }
+}
+  }
+
+  test("change stats after add/drop partition command") {
--- End diff --

we can't get table properties through hiveClient in the parent class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-27 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18334#discussion_r124456568
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -448,6 +433,145 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   "ALTER TABLE unset_prop_table UNSET TBLPROPERTIES ('prop1')")
   }
 
+  /**
+   * To see if stats exist, we need to check spark's stats properties 
instead of catalog
+   * statistics, because hive would change stats in metastore and thus 
change catalog statistics.
--- End diff --

Hive updates stats for some formats, e.g. in the test `test statistics of 
LogicalRelation converted from Hive serde tables`, hive updates stats for orc 
table but not parquet table.
I think it's tricky and fussy to relay on hive's stats, we should rely on 
spark's stats. Here I just eliminate hive's influence for checking. As in 
`CatalogStatistics` we first fill in hive's stats and then override it using 
spark's stats, we can't tell where the stats is from if we check 
`CatalogStatistics`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-27 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18334#discussion_r124445546
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -448,6 +433,145 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   "ALTER TABLE unset_prop_table UNSET TBLPROPERTIES ('prop1')")
   }
 
+  /**
+   * To see if stats exist, we need to check spark's stats properties 
instead of catalog
+   * statistics, because hive would change stats in metastore and thus 
change catalog statistics.
+   */
+  private def getStatsProperties(tableName: String): Map[String, String] = 
{
+val hTable = 
hiveClient.getTable(spark.sessionState.catalog.getCurrentDatabase, tableName)
+hTable.properties.filterKeys(_.startsWith(STATISTICS_PREFIX))
+  }
+
+  test("change stats after insert command for hive table") {
+val table = s"change_stats_insert_hive_table"
+Seq(false, true).foreach { autoUpdate =>
+  withSQLConf(SQLConf.AUTO_UPDATE_SIZE.key -> autoUpdate.toString) {
+withTable(table) {
+  sql(s"CREATE TABLE $table (i int, j string)")
+  // analyze to get initial stats
+  sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS i, j")
+  val fetched1 = checkTableStats(table, hasSizeInBytes = true, 
expectedRowCounts = Some(0))
+  assert(fetched1.get.sizeInBytes == 0)
+  assert(fetched1.get.colStats.size == 2)
+
+  // insert into command
+  sql(s"INSERT INTO TABLE $table SELECT 1, 'abc'")
+  if (autoUpdate) {
+val fetched2 = checkTableStats(table, hasSizeInBytes = true, 
expectedRowCounts = None)
+assert(fetched2.get.sizeInBytes > 0)
+assert(fetched2.get.colStats.isEmpty)
+val statsProp = getStatsProperties(table)
+assert(statsProp(STATISTICS_TOTAL_SIZE).toLong == 
fetched2.get.sizeInBytes)
+  } else {
+assert(getStatsProperties(table).isEmpty)
+  }
+}
+  }
+}
+  }
+
+  test("change stats after load data command") {
+val table = "change_stats_load_table"
+Seq(false, true).foreach { autoUpdate =>
+  withSQLConf(SQLConf.AUTO_UPDATE_SIZE.key -> autoUpdate.toString) {
+withTable(table) {
+  sql(s"CREATE TABLE $table (i INT, j STRING) STORED AS PARQUET")
+  // analyze to get initial stats
+  sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS i, j")
+  val fetched1 = checkTableStats(table, hasSizeInBytes = true, 
expectedRowCounts = Some(0))
+  assert(fetched1.get.sizeInBytes == 0)
+  assert(fetched1.get.colStats.size == 2)
+
+  withTempDir { loadPath =>
+// load data command
+val file = new File(loadPath + "/data")
+val writer = new PrintWriter(file)
+writer.write("2,xyz")
+writer.close()
+sql(s"LOAD DATA INPATH '${loadPath.toURI.toString}' INTO TABLE 
$table")
+if (autoUpdate) {
+  val fetched2 = checkTableStats(table, hasSizeInBytes = true, 
expectedRowCounts = None)
+  assert(fetched2.get.sizeInBytes > 0)
+  assert(fetched2.get.colStats.isEmpty)
+  val statsProp = getStatsProperties(table)
+  assert(statsProp(STATISTICS_TOTAL_SIZE).toLong == 
fetched2.get.sizeInBytes)
+} else {
+  assert(getStatsProperties(table).isEmpty)
+}
+  }
+}
+  }
+}
+  }
+
+  test("change stats after add/drop partition command") {
--- End diff --

can we put these tests in the parent class? so we don't need to duplicate 
the code for hvie and data source tables.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-27 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18334#discussion_r124445434
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -448,6 +433,145 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   "ALTER TABLE unset_prop_table UNSET TBLPROPERTIES ('prop1')")
   }
 
+  /**
+   * To see if stats exist, we need to check spark's stats properties 
instead of catalog
+   * statistics, because hive would change stats in metastore and thus 
change catalog statistics.
--- End diff --

do you mean we will have table stats anyway as hive metastore stores it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-27 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18334#discussion_r124445210
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala ---
@@ -168,6 +157,92 @@ class StatisticsCollectionSuite extends 
StatisticsCollectionTestBase with Shared
   assert(stats.simpleString == expectedString)
 }
   }
+
+  test("change stats after truncate command") {
+val table = "change_stats_truncate_table"
+Seq(false, true).foreach { autoUpdate =>
+  withSQLConf(SQLConf.AUTO_UPDATE_SIZE.key -> autoUpdate.toString) {
--- End diff --

`TruncateTableCommand` doesn't read the `AUTO_UPDATE_SIZE` config at all, 
we don't need to test with this config.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-27 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18334#discussion_r124445015
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
 ---
@@ -161,6 +161,11 @@ case class InsertIntoHadoopFsRelationCommand(
   fileIndex.foreach(_.refresh())
   // refresh data cache if table is cached
   sparkSession.catalog.refreshByPath(outputPath.toString)
+
+  if (catalogTable.nonEmpty) {
--- End diff --

do we need this if? We also do it in `CommandUtils.updateTableStats`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-27 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18334#discussion_r124433756
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
 ---
@@ -161,6 +161,11 @@ case class InsertIntoHadoopFsRelationCommand(
   fileIndex.foreach(_.refresh())
   // refresh data cache if table is cached
   sparkSession.catalog.refreshByPath(outputPath.toString)
+
+  if (catalogTable.nonEmpty) {
+CommandUtils.updateTableStats(sparkSession, catalogTable.get)
--- End diff --

We already add a flag to enable/disable auto update, the flag is used 
inside `CommandUtils.updateTableStats` in order to reduce code redundancy.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18334#discussion_r124323569
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
 ---
@@ -161,6 +161,11 @@ case class InsertIntoHadoopFsRelationCommand(
   fileIndex.foreach(_.refresh())
   // refresh data cache if table is cached
   sparkSession.catalog.refreshByPath(outputPath.toString)
+
+  if (catalogTable.nonEmpty) {
+CommandUtils.updateTableStats(sparkSession, catalogTable.get)
--- End diff --

> hive.stats.autogather
> Default Value: true
> Added In: Hive 0.7 with HIVE-1361
> A flag to gather statistics automatically during the INSERT OVERWRITE 
command.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18334#discussion_r124323435
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
 ---
@@ -161,6 +161,11 @@ case class InsertIntoHadoopFsRelationCommand(
   fileIndex.foreach(_.refresh())
   // refresh data cache if table is cached
   sparkSession.catalog.refreshByPath(outputPath.toString)
+
+  if (catalogTable.nonEmpty) {
+CommandUtils.updateTableStats(sparkSession, catalogTable.get)
--- End diff --

This is not acceptable in the cloud environment. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18334#discussion_r124323305
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
 ---
@@ -161,6 +161,11 @@ case class InsertIntoHadoopFsRelationCommand(
   fileIndex.foreach(_.refresh())
   // refresh data cache if table is cached
   sparkSession.catalog.refreshByPath(outputPath.toString)
+
+  if (catalogTable.nonEmpty) {
+CommandUtils.updateTableStats(sparkSession, catalogTable.get)
--- End diff --

When the number of files are large, it is very slow. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-26 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18334#discussion_r123998237
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -774,6 +774,13 @@ object SQLConf {
   .doubleConf
   .createWithDefault(0.05)
 
+  val AUTO_UPDATE_SIZE =
+buildConf("spark.sql.statistics.autoUpdate.size")
+  .doc("Enables automatic update for table size once table's data is 
changed. " +
+  "Note that this update also removes other statistics (e.g. row count 
and column statistics)")
--- End diff --

this is misleading, even they don't do auto update, row count and column 
statistics will also be removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-26 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18334#discussion_r123952350
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
 ---
@@ -0,0 +1,112 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql.execution.command
+
+import java.net.URI
+
+import scala.util.control.NonFatal
+
+import org.apache.hadoop.fs.{FileSystem, Path}
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.catalog.{CatalogStatistics, 
CatalogTable}
+import org.apache.spark.sql.internal.SessionState
+
+
+object CommandUtils extends Logging {
+
+  /**
+   * Update statistics (currently only sizeInBytes) after changing data by 
commands.
+   */
+  def updateTableStats(
+  sparkSession: SparkSession,
+  table: CatalogTable,
+  newTableSize: Option[BigInt] = None,
+  newRowCount: Option[BigInt] = None): Unit = {
+if (sparkSession.sessionState.conf.autoStatsUpdate && 
table.stats.nonEmpty) {
+  val catalog = sparkSession.sessionState.catalog
+  val newTable = catalog.getTableMetadata(table.identifier)
+  val newSize = newTableSize.getOrElse(
+CommandUtils.calculateTotalSize(sparkSession.sessionState, 
newTable))
+  catalog.alterTableStats(table.identifier,
+CatalogStatistics(sizeInBytes = newSize, rowCount = newRowCount))
--- End diff --

Sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-26 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18334#discussion_r123944288
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
 ---
@@ -0,0 +1,112 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql.execution.command
+
+import java.net.URI
+
+import scala.util.control.NonFatal
+
+import org.apache.hadoop.fs.{FileSystem, Path}
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.catalog.{CatalogStatistics, 
CatalogTable}
+import org.apache.spark.sql.internal.SessionState
+
+
+object CommandUtils extends Logging {
+
+  /**
+   * Update statistics (currently only sizeInBytes) after changing data by 
commands.
+   */
+  def updateTableStats(
+  sparkSession: SparkSession,
+  table: CatalogTable,
+  newTableSize: Option[BigInt] = None,
+  newRowCount: Option[BigInt] = None): Unit = {
+if (sparkSession.sessionState.conf.autoStatsUpdate && 
table.stats.nonEmpty) {
+  val catalog = sparkSession.sessionState.catalog
+  val newTable = catalog.getTableMetadata(table.identifier)
+  val newSize = newTableSize.getOrElse(
+CommandUtils.calculateTotalSize(sparkSession.sessionState, 
newTable))
+  catalog.alterTableStats(table.identifier,
+CatalogStatistics(sizeInBytes = newSize, rowCount = newRowCount))
--- End diff --

then shall we update the config name? we need to mention that it only auto 
updates table size


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-25 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18334#discussion_r123937554
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
 ---
@@ -0,0 +1,112 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql.execution.command
+
+import java.net.URI
+
+import scala.util.control.NonFatal
+
+import org.apache.hadoop.fs.{FileSystem, Path}
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.catalog.{CatalogStatistics, 
CatalogTable}
+import org.apache.spark.sql.internal.SessionState
+
+
+object CommandUtils extends Logging {
+
+  /**
+   * Update statistics (currently only sizeInBytes) after changing data by 
commands.
+   */
+  def updateTableStats(
+  sparkSession: SparkSession,
+  table: CatalogTable,
+  newTableSize: Option[BigInt] = None,
+  newRowCount: Option[BigInt] = None): Unit = {
+if (sparkSession.sessionState.conf.autoStatsUpdate && 
table.stats.nonEmpty) {
+  val catalog = sparkSession.sessionState.catalog
+  val newTable = catalog.getTableMetadata(table.identifier)
+  val newSize = newTableSize.getOrElse(
+CommandUtils.calculateTotalSize(sparkSession.sessionState, 
newTable))
+  catalog.alterTableStats(table.identifier,
+CatalogStatistics(sizeInBytes = newSize, rowCount = newRowCount))
--- End diff --

Since updating size only needs to calculate file sizes, while updating 
other stats needs to scan the data, the two costs are in different orders of 
magnitude. How about adding another flag to auto update other stats in a 
separate pr or followup?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-25 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18334#discussion_r123935398
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
 ---
@@ -0,0 +1,112 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql.execution.command
+
+import java.net.URI
+
+import scala.util.control.NonFatal
+
+import org.apache.hadoop.fs.{FileSystem, Path}
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.catalog.{CatalogStatistics, 
CatalogTable}
+import org.apache.spark.sql.internal.SessionState
+
+
+object CommandUtils extends Logging {
+
+  /**
+   * Update statistics (currently only sizeInBytes) after changing data by 
commands.
+   */
+  def updateTableStats(
+  sparkSession: SparkSession,
+  table: CatalogTable,
+  newTableSize: Option[BigInt] = None,
+  newRowCount: Option[BigInt] = None): Unit = {
+if (sparkSession.sessionState.conf.autoStatsUpdate && 
table.stats.nonEmpty) {
+  val catalog = sparkSession.sessionState.catalog
+  val newTable = catalog.getTableMetadata(table.identifier)
+  val newSize = newTableSize.getOrElse(
+CommandUtils.calculateTotalSize(sparkSession.sessionState, 
newTable))
+  catalog.alterTableStats(table.identifier,
+CatalogStatistics(sizeInBytes = newSize, rowCount = newRowCount))
--- End diff --

In my initial design, I planned to update size by default and update other 
stats based on a flag.
Auto update all stats seems too aggressive to me with one flag...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-25 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18334#discussion_r123930251
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
 ---
@@ -0,0 +1,112 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql.execution.command
+
+import java.net.URI
+
+import scala.util.control.NonFatal
+
+import org.apache.hadoop.fs.{FileSystem, Path}
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.catalog.{CatalogStatistics, 
CatalogTable}
+import org.apache.spark.sql.internal.SessionState
+
+
+object CommandUtils extends Logging {
+
+  /**
+   * Update statistics (currently only sizeInBytes) after changing data by 
commands.
+   */
+  def updateTableStats(
+  sparkSession: SparkSession,
+  table: CatalogTable,
+  newTableSize: Option[BigInt] = None,
+  newRowCount: Option[BigInt] = None): Unit = {
+if (sparkSession.sessionState.conf.autoStatsUpdate && 
table.stats.nonEmpty) {
+  val catalog = sparkSession.sessionState.catalog
+  val newTable = catalog.getTableMetadata(table.identifier)
+  val newSize = newTableSize.getOrElse(
+CommandUtils.calculateTotalSize(sparkSession.sessionState, 
newTable))
+  catalog.alterTableStats(table.identifier,
+CatalogStatistics(sizeInBytes = newSize, rowCount = newRowCount))
--- End diff --

since we are protected by a flag, can we be more aggressive and auto update 
all stats?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-25 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18334#discussion_r123929887
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala ---
@@ -165,6 +167,22 @@ private[sql] trait SQLTestUtils
   }
 
   /**
+   * Creates the specified number of temporary directories, which is then 
passed to `f` and will be
+   * deleted after `f` returns.
+   */
+  protected def withTempPaths(numPaths: Int)(f: Seq[File] => Unit): Unit = 
{
+val files = mutable.Buffer[File]()
--- End diff --

nit: we can just create an array as we know the size.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-25 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18334#discussion_r123928827
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
 ---
@@ -0,0 +1,112 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql.execution.command
+
+import java.net.URI
+
+import scala.util.control.NonFatal
+
+import org.apache.hadoop.fs.{FileSystem, Path}
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.catalog.{CatalogStatistics, 
CatalogTable}
+import org.apache.spark.sql.internal.SessionState
+
+
+object CommandUtils extends Logging {
+
+  /**
+   * Update statistics (currently only sizeInBytes) after changing data by 
commands.
+   */
+  def updateTableStats(
+  sparkSession: SparkSession,
+  table: CatalogTable,
+  newTableSize: Option[BigInt] = None,
+  newRowCount: Option[BigInt] = None): Unit = {
+if (sparkSession.sessionState.conf.autoStatsUpdate && 
table.stats.nonEmpty) {
+  val catalog = sparkSession.sessionState.catalog
+  val newTable = catalog.getTableMetadata(table.identifier)
+  val newSize = newTableSize.getOrElse(
+CommandUtils.calculateTotalSize(sparkSession.sessionState, 
newTable))
+  catalog.alterTableStats(table.identifier,
+CatalogStatistics(sizeInBytes = newSize, rowCount = newRowCount))
--- End diff --

so we never auto update column stats?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-24 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18334#discussion_r123891656
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -773,6 +773,12 @@ object SQLConf {
   .doubleConf
   .createWithDefault(0.05)
 
+  val AUTO_STATS_UPDATE =
+buildConf("spark.sql.statistics.autoUpdate")
+  .doc("Enables automatic statistics update once table's data is 
changed.")
+  .booleanConf
+  .createWithDefault(false)
--- End diff --

we can set it to true by default once we have incremental stats collection.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-19 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18334#discussion_r122790775
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
 ---
@@ -161,6 +161,11 @@ case class InsertIntoHadoopFsRelationCommand(
   fileIndex.foreach(_.refresh())
   // refresh data cache if table is cached
   sparkSession.catalog.refreshByPath(outputPath.toString)
+
+  if (catalogTable.nonEmpty) {
+CommandUtils.updateTableStats(sparkSession, catalogTable.get)
--- End diff --

It's difficult to get the size of `query` in insert command.
Besides, I personally prefer simple implementation here, because usually 
the overhead of scanning files is negligible compared to the append operation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-18 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18334#discussion_r122600223
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
 ---
@@ -161,6 +161,11 @@ case class InsertIntoHadoopFsRelationCommand(
   fileIndex.foreach(_.refresh())
   // refresh data cache if table is cached
   sparkSession.catalog.refreshByPath(outputPath.toString)
+
+  if (catalogTable.nonEmpty) {
+CommandUtils.updateTableStats(sparkSession, catalogTable.get)
--- End diff --

For append mode, is that possible we can just add the delta, instead of 
re-calculating the whole table?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-17 Thread wzhfy
GitHub user wzhfy opened a pull request:

https://github.com/apache/spark/pull/18334

[SPARK-21127] [SQL] Update statistics after data changing commands

## What changes were proposed in this pull request?

Update stats after the following data changing commands:

- InsertIntoHadoopFsRelationCommand
- InsertIntoHiveTable
- LoadDataCommand
- TruncateTableCommand
- AlterTableSetLocationCommand
- AlterTableDropPartitionCommand

## How was this patch tested?
Added new test cases.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wzhfy/spark changeStatsForOperation

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18334.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18334


commit 9d4d97a272a74d99f79026648b0af72b4f5249ab
Author: Zhenhua Wang 
Date:   2017-06-17T07:31:35Z

update stats after commands




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org