[GitHub] spark pull request #14123: [SPARK-16471] [SQL] Remove Hive-specific CreateHi...

gatorsmile Sun, 10 Jul 2016 00:43:01 -0700

GitHub user gatorsmile opened a pull request:

    https://github.com/apache/spark/pull/14123


    [SPARK-16471] [SQL] Remove Hive-specific CreateHiveTableAsSelectLogicalPlan 
[WIP]

    #### What changes were proposed in this pull request?
    `CreateHiveTableAsSelectLogicalPlan` is a Hive-specific logical node. This 
is not a good design. We need to consolidate it into the corresponding 
`CreateTableUsingAsSelect`.
    
    The first step is to make more general the signature of 
`CreateTableUsingAsSelect` by using `CatalogTable` as the input of Table 
metadata. The logical node name will be renamed to `CreateTableAsSelect`. The 
new interface will be like
    ```Scala
    case class CreateTableAsSelect(
        tableDesc: CatalogTable,
        provider: String,
        mode: SaveMode,
        child: LogicalPlan) extends logical.UnaryNode 
    ```
    The second step is to convert `CreateHiveTableAsSelectLogicalPlan` into 
`CreateTableAsSelect `.  
    
    This PR is based on the compare of the two interfaces. The details are 
described below.
    
    Currently, the SQL interface is the only only entrance to 
`CreateHiveTableAsSelectLogicalPlan`. Below describes the correspondence 
between the SQL interface and `CreateHiveTableAsSelectLogicalPlan `
    ```Scala
    case class CreateHiveTableAsSelectLogicalPlan(
        tableDesc: CatalogTable,
        child: LogicalPlan,
        allowExisting: Boolean)
        extends UnaryNode with Command 
    ```
    ```Scala
    SQL:
    
    When conf.convertCTAS == false || either [ROW FORMAT row_format] or [STORED 
AS file_format] is specified
    
      CREATE [EXTERNAL] [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name
      [(col1[:] data_type [COMMENT col_comment], ...)]
      [COMMENT table_comment]
      [PARTITIONED BY (col2[:] data_type [COMMENT col_comment], ...)]
      [ROW FORMAT row_format]
      [STORED AS file_format]
      [LOCATION path]
      [TBLPROPERTIES (property_name=property_value, ...)]
      [AS select_statement];
      
      -->
      
      [TEMPORARY] is not allowed.
    
      allowExisting: Boolean = [IF NOT EXISTS]
      child: LogicalPlan = select_statement
      tableDesc: CatalogTable = CatalogTable(
        identifier = [db_name.]table_name,
        tableType = [EXTERNAL],
        storage = [ROW FORMAT row_format +
                  [STORED AS file_format] +
                  [LOCATION path],
        schema = Seq.empty,
        partitionColumnNames = Seq.empty,
        properties = [TBLPROPERTIES (property_name=property_value, ...)],
        comment = [COMMENT table_comment])
    ```
    
    `CreateTableUsingAsSelect` has three entrances. Below is the the 
correspondence:
    ```Scala
    case class CreateTableUsingAsSelect(
        tableIdent: TableIdentifier,
        provider: String,
        partitionColumns: Array[String],
        bucketSpec: Option[BucketSpec],
        mode: SaveMode,
        options: Map[String, String],
        child: LogicalPlan) extends logical.UnaryNode 
    ```
    ```Scala
    SQL Interface I:
    
    When conf.convertCTAS == true && [ROW FORMAT row_format] and [STORED AS 
file_format] are not specified
    
      CREATE [EXTERNAL] [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name
      [(col1[:] data_type [COMMENT col_comment], ...)]
      [COMMENT table_comment]
      [PARTITIONED BY (col2[:] data_type [COMMENT col_comment], ...)]
      [ROW FORMAT row_format]
      [STORED AS file_format]
      [LOCATION path]
      [TBLPROPERTIES (property_name=property_value, ...)]
      [AS select_statement];
      
      --> 
      
      tableIdent: TableIdentifier = [db_name.]table_name,
      provider: String = conf.defaultDataSourceName,
      partitionColumns: Array[String] = Seq.empty,
      bucketSpec: Option[BucketSpec] = None,
      mode: SaveMode = [IF NOT EXISTS],
      options: Map[String, String] = [LOCATION path],
      child: LogicalPlan = [AS select_statement]
    ```
    ```Scala
    SQL Interface II:
    
      CREATE [EXTERNAL] [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name
      [(col1[:] data_type [COMMENT col_comment], ...)]
      USING qualifiedName
      [OPTIONS tablePropertyList)]
      [PARTITIONED BY (col2[:] data_type [COMMENT col_comment], ...)]
      [CLUSTERED BY (col3, ...) (SORTED BY orderedIdentifierList)? INTO 
INTEGER_VALUE BUCKETS]
      [AS select_statement];
    
      -->
    
      [EXTERNAL] is not allowed.
      [TEMPORARY] is not allowed.
    
      tableIdent: TableIdentifier = [db_name.]table_name,
      provider: String = USING qualifiedName,
      partitionColumns: Array[String] = [PARTITIONED BY (col2[:] data_type 
[COMMENT col_comment], ...)],
      bucketSpec: Option[BucketSpec] = [CLUSTERED BY (col3, ...) (SORTED BY 
orderedIdentifierList)? INTO INTEGER_VALUE BUCKETS],
      mode: SaveMode = [IF NOT EXISTS],
      options: Map[String, String] = [OPTIONS tablePropertyList)],
      child: LogicalPlan = [AS select_statement]
    ```
    ```Scala
    DataFrameWriter Interface:
    
      tableIdent: TableIdentifier = tableIdent (from saveAsTable API),
      provider: String = source (from format API),
      partitionColumns: Array[String] = partitioningColumns (from partitionBy 
API),
      bucketSpec: Option[BucketSpec] = getBucketSpec function (from bucketBy 
API and sortBy API),
      mode: SaveMode = mode (from mode API),
      options: Map[String, String] = extraOptions (from option and options API),
      child: LogicalPlan = df.logicalPlan (from DataFrameWriter)
    ```
    
    #### How was this patch tested?
    The existing test cases cover the code refactoring

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gatorsmile/spark removeHiveCTASLogicalNode

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14123.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14123
    
----
commit 55b1a8c4f44611b2f7372acef5b79dad5833d105
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2016-07-09T07:32:57Z

    fix.

commit 568f13352a30412c74b267bad1b339c17653f02c
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2016-07-10T07:09:30Z

    fix1

commit 5bef1e8353d41c7fa333bba78502934211692a15
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2016-07-10T07:20:48Z

    revert

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14123: [SPARK-16471] [SQL] Remove Hive-specific CreateHi...

Reply via email to