[GitHub] spark pull request #14155: [SPARK-16498][SQL] move hive hack for data source...

cloud-fan Fri, 19 Aug 2016 21:50:18 -0700

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14155#discussion_r75572713
  
    --- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
    @@ -200,22 +375,77 @@ private[spark] class HiveExternalCatalog(client: 
HiveClient, hadoopConf: Configu
        * Alter a table whose name that matches the one specified in 
`tableDefinition`,
        * assuming the table exists.
        *
    -   * Note: As of now, this only supports altering table properties, serde 
properties,
    -   * and num buckets!
    +   * Note: As of now, this doesn't support altering table schema, 
partition column names and bucket
    +   * specification. We will ignore them even if users do specify different 
values for these fields.
        */
       override def alterTable(tableDefinition: CatalogTable): Unit = 
withClient {
         assert(tableDefinition.identifier.database.isDefined)
         val db = tableDefinition.identifier.database.get
         requireTableExists(db, tableDefinition.identifier.table)
    -    client.alterTable(tableDefinition)
    +    verifyTableProperties(tableDefinition)
    +
    +    if (tableDefinition.provider == Some("hive") || 
tableDefinition.tableType == VIEW) {
    +      client.alterTable(tableDefinition)
    +    } else {
    +      val oldDef = client.getTable(db, tableDefinition.identifier.table)
    +      // Sets the `schema`, `partitionColumnNames` and `bucketSpec` from 
the old table definition,
    +      // to retain the spark specific format if it is.
    +      // Also add table meta properties to table properties, to retain the 
data source table format.
    +      val newDef = tableDefinition.copy(
    +        schema = oldDef.schema,
    +        partitionColumnNames = oldDef.partitionColumnNames,
    +        bucketSpec = oldDef.bucketSpec,
    +        properties = tableMetadataToProperties(tableDefinition) ++ 
tableDefinition.properties)
    +
    +      client.alterTable(newDef)
    +    }
       }
     
       override def getTable(db: String, table: String): CatalogTable = 
withClient {
    -    client.getTable(db, table)
    +    restoreTableMetadata(client.getTable(db, table))
       }
     
       override def getTableOption(db: String, table: String): 
Option[CatalogTable] = withClient {
    -    client.getTableOption(db, table)
    +    client.getTableOption(db, table).map(restoreTableMetadata)
    +  }
    +
    +  /**
    +   * Restores table metadata from the table properties if it's a datasouce 
table. This method is
    +   * kind of a opposite version of [[createTable]].
    +   *
    +   * It reads table schema, provider, partition column names and bucket 
specification from table
    +   * properties, and filter out these special entries from table 
properties.
    +   */
    +  private def restoreTableMetadata(table: CatalogTable): CatalogTable = {
    +    if (table.tableType == VIEW) {
    +      table
    +    } else {
    +      getProviderFromTableProperties(table).map { provider =>
    +        // SPARK-15269: Persisted data source tables always store the 
location URI as a storage
    +        // property named "path" instead of standard Hive `dataLocation`, 
because Hive only
    +        // allows directory paths as location URIs while Spark SQL data 
source tables also
    +        // allows file paths. So the standard Hive `dataLocation` is 
meaningless for Spark SQL
    +        // data source tables.
    +        // Spark SQL may also save external data source in Hive compatible 
format when
    +        // possible, so that these tables can be directly accessed by 
Hive. For these tables,
    +        // `dataLocation` is still necessary. Here we also check for input 
format because only
    +        // these Hive compatible tables set this field.
    +        val storage = if (table.tableType == EXTERNAL && 
table.storage.inputFormat.isEmpty) {
    +          table.storage.copy(locationUri = None)
    +        } else {
    +          table.storage
    +        }
    +        table.copy(
    +          storage = storage,
    +          schema = getSchemaFromTableProperties(table),
    +          provider = Some(provider),
    +          partitionColumnNames = 
getPartitionColumnsFromTableProperties(table),
    +          bucketSpec = getBucketSpecFromTableProperties(table),
    +          properties = getOriginalTableProperties(table))
    --- End diff --
    
    The previous code also store options to serde properties, I'm not going to 
fix everything in this PR, and I'm not sure if it's a real problem, but let's 
continue the discussion in follow-up.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14155: [SPARK-16498][SQL] move hive hack for data source...

Reply via email to