GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/14207
[SPARK-16552] [SQL] [WIP] Store the Inferred Schemas into External Catalog Tables when Creating Tables #### What changes were proposed in this pull request? Currently, in Spark SQL, the initial creation of schema can be classified into two groups. It is applicable to both Hive tables and Data Source tables: **Group A. Users specify the schema.** _Case 1 CREATE TABLE AS SELECT_: the schema is determined by the result schema of the SELECT clause. For example, ```SQL CREATE TABLE tab STORED AS TEXTFILE AS SELECT * from input ``` _Case 2 CREATE TABLE_: users explicitly specify the schema. For example, ```SQL CREATE TABLE jsonTable (_1 string, _2 string) USING org.apache.spark.sql.json ``` **Group B. Spark SQL infer the schema at runtime.** _Case 3 CREATE TABLE_. Users do not specify the schema but the path to the file location. For example, ```SQL CREATE TABLE jsonTable USING org.apache.spark.sql.json OPTIONS (path '${tempDir.getCanonicalPath}') ``` Before this PR, Spark SQL does not store the inferred schema in the external catalog for the cases in Group B. When users refreshing the metadata cache, accessing the table at the first time after (re-)starting Spark, Spark SQL will infer the schema and store the info in the metadata cache for improving the performance of subsequent metadata requests. However, the runtime schema inference could cause undesirable schema changes after each reboot of Spark. This PR is to store the inferred schema in the external catalog when creating the table. When users intend to refresh the schema, they issue `REFRESH TABLE`. Spark SQL will infer the schema again based on the previously specified table location and update/refresh the schema in the external catalog and metadata cache. In this PR, we do not use the inferred schema to replace the user specified schema for avoiding external behavior changes . Based on the design, user-specified schemas (as described in Group A) can be changed by ALTER TABLE commands, although we do not support them now. #### How was this patch tested? TODO: add more cases to cover the changes. You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark userSpecifiedSchema Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14207.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14207 ---- commit 3c992a9eb39e3258776e52d0524b8bc46bc3ee08 Author: gatorsmile <gatorsm...@gmail.com> Date: 2016-07-14T07:08:43Z fix. commit 5ed4e68283dd0ee0ad5deddc787eae8fe47f7574 Author: gatorsmile <gatorsm...@gmail.com> Date: 2016-07-14T07:10:37Z Merge remote-tracking branch 'upstream/master' into userSpecifiedSchema commit 3be0dc0b7cfd942459c598c0d35f3d67a2c020ba Author: gatorsmile <gatorsm...@gmail.com> Date: 2016-07-14T19:19:40Z fix. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org