[GitHub] spark pull request #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Sche...

gatorsmile Thu, 14 Jul 2016 13:12:07 -0700

GitHub user gatorsmile opened a pull request:

    https://github.com/apache/spark/pull/14207


    [SPARK-16552] [SQL] [WIP] Store the Inferred Schemas into External Catalog 
Tables when Creating Tables

    #### What changes were proposed in this pull request?
    
    Currently, in Spark SQL, the initial creation of schema can be classified 
into two groups. It is applicable to both Hive tables and Data Source tables:
    
    **Group A. Users specify the schema.** 
    
    _Case 1 CREATE TABLE AS SELECT_: the schema is determined by the result 
schema of the SELECT clause. For example,
    ```SQL
    CREATE TABLE tab STORED AS TEXTFILE
    AS SELECT * from input
    ```
    
    _Case 2 CREATE TABLE_: users explicitly specify the schema. For example,
    ```SQL
    CREATE TABLE jsonTable (_1 string, _2 string)
    USING org.apache.spark.sql.json
    ```
    
    **Group B. Spark SQL infer the schema at runtime.**
    
    _Case 3 CREATE TABLE_. Users do not specify the schema but the path to the 
file location. For example,
    ```SQL
    CREATE TABLE jsonTable 
    USING org.apache.spark.sql.json
    OPTIONS (path '${tempDir.getCanonicalPath}')
    ```
    
    Before this PR, Spark SQL does not store the inferred schema in the 
external catalog for the cases in Group B. When users refreshing the metadata 
cache, accessing the table at the first time after (re-)starting Spark, Spark 
SQL will infer the schema and store the info in the metadata cache for 
improving the performance of subsequent metadata requests. However, the runtime 
schema inference could cause undesirable schema changes after each reboot of 
Spark.
    
    This PR is to store the inferred schema in the external catalog when 
creating the table. When users intend to refresh the schema, they issue 
`REFRESH TABLE`. Spark SQL will infer the schema again based on the previously 
specified table location and update/refresh the schema in the external catalog 
and metadata cache. 
    
    In this PR, we do not use the inferred schema to replace the user specified 
schema for avoiding external behavior changes . Based on the design, 
user-specified schemas (as described in Group A) can be changed by ALTER TABLE 
commands, although we do not support them now. 
    
    
    #### How was this patch tested?
    TODO: add more cases to cover the changes.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gatorsmile/spark userSpecifiedSchema

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14207.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14207
    
----
commit 3c992a9eb39e3258776e52d0524b8bc46bc3ee08
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2016-07-14T07:08:43Z

    fix.

commit 5ed4e68283dd0ee0ad5deddc787eae8fe47f7574
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2016-07-14T07:10:37Z

    Merge remote-tracking branch 'upstream/master' into userSpecifiedSchema

commit 3be0dc0b7cfd942459c598c0d35f3d67a2c020ba
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2016-07-14T19:19:40Z

    fix.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14207: [SPARK-16552] [SQL] [WIP] Store the Inferred Sche...

Reply via email to