[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...

gatorsmile Tue, 30 Aug 2016 21:40:13 -0700

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14712#discussion_r76923549
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala
 ---
    @@ -21,25 +21,55 @@ import scala.util.control.NonFatal
     
     import org.apache.hadoop.fs.{FileSystem, Path}
     
    -import org.apache.spark.sql.{AnalysisException, Row, SparkSession}
    +import org.apache.spark.sql.{AnalysisException, Dataset, Row, SparkSession}
     import org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases
     import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, 
CatalogTable}
    +import org.apache.spark.sql.catalyst.plans.logical.Statistics
    +import org.apache.spark.sql.execution.datasources.LogicalRelation
     
     
     /**
      * Analyzes the given table in the current database to generate 
statistics, which will be
      * used in query optimizations.
    - *
    - * Right now, it only supports Hive tables and it only updates the size of 
a Hive table
    - * in the Hive metastore.
      */
    -case class AnalyzeTableCommand(tableName: String) extends RunnableCommand {
    +case class AnalyzeTableCommand(tableName: String, noscan: Boolean = true) 
extends RunnableCommand {
     
       override def run(sparkSession: SparkSession): Seq[Row] = {
         val sessionState = sparkSession.sessionState
         val tableIdent = sessionState.sqlParser.parseTableIdentifier(tableName)
         val relation = 
EliminateSubqueryAliases(sessionState.catalog.lookupRelation(tableIdent))
     
    +    def updateTableStats(
    +        catalogTable: CatalogTable,
    +        oldTotalSize: Long,
    +        oldRowCount: Long,
    +        newTotalSize: Long): Unit = {
    +
    +      var newStats: Option[Statistics] = None
    +      if (newTotalSize > 0 && newTotalSize != oldTotalSize) {
    +        newStats = Some(Statistics(sizeInBytes = newTotalSize))
    +      }
    +      if (!noscan) {
    +        val newRowCount = Dataset.ofRows(sparkSession, relation).count()
    +        if (newRowCount >= 0 && newRowCount != oldRowCount) {
    +          newStats = if (newStats.isDefined) {
    +            newStats.map(_.copy(rowCount = Some(BigInt(newRowCount))))
    +          } else {
    +            Some(Statistics(sizeInBytes = oldTotalSize, rowCount = 
Some(BigInt(newRowCount))))
    +          }
    +        }
    +      }
    +      // Update the metastore if the above statistics of the table are 
different from those
    +      // recorded in the metastore.
    +      if (newStats.isDefined) {
    +        sessionState.catalog.alterTable(
    +          catalogTable.copy(catalogStats = newStats), fromAnalyze = true)
    +
    +        // Refresh the cache of the table in the catalog.
    --- End diff --
    
    This comment is confusing. We have two caches. One is the data cache, 
another is logical plan cache for data source tables.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...

Reply via email to