[GitHub] spark pull request #19003: [SPARK-21769] [SQL] Add a table-specific option f...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19003 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19003: [SPARK-21769] [SQL] Add a table-specific option f...
Github user gatorsmile closed the pull request at: https://github.com/apache/spark/pull/19003 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19003: [SPARK-21769] [SQL] Add a table-specific option f...
GitHub user gatorsmile reopened a pull request: https://github.com/apache/spark/pull/19003 [SPARK-21769] [SQL] Add a table-specific option for always respecting schemas inferred/controlled by Spark SQL ## What changes were proposed in this pull request? For Hive-serde tables, we always respect the schema stored in Hive metastore, because the schema could be altered by the other engines that share the same metastore. Thus, we always trust the metastore-controlled schema for Hive-serde tables when the schemas are different (without considering the nullability and cases). However, in some scenarios, Hive metastore also could INCORRECTLY overwrite the schemas when the serde and Hive metastore built-in serde are different. The proposed solution is to introduce a table-specific option for such scenarios. For a specific table, users can make Spark always respect Spark-inferred/controlled schema instead of trusting metastore-controlled schema. By default, we trust Hive metastore-controlled schema. ## How was this patch tested? Added a cross-version test case You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark respectSparkSchema Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19003.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19003 commit 4c7349f5d7cef703e11d93e114c8361a940e8bfa Author: gatorsmileDate: 2017-08-20T03:17:12Z fix. commit 36339c809a086fb1bb94ec167bf2fa9e4169aca1 Author: gatorsmile Date: 2017-08-22T18:22:26Z fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19003: [SPARK-21769] [SQL] Add a table-specific option f...
Github user sameeragarwal commented on a diff in the pull request: https://github.com/apache/spark/pull/19003#discussion_r134557690 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SourceOptions.scala --- @@ -0,0 +1,50 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap + +/** + * Options for the Parquet data source. --- End diff -- nit: update docs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19003: [SPARK-21769] [SQL] Add a table-specific option f...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19003#discussion_r134392097 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala --- @@ -763,6 +763,47 @@ class VersionsSuite extends SparkFunSuite with Logging { } } +test(s"$version: read avro file containing decimal") { + val url = Thread.currentThread().getContextClassLoader.getResource("avroDecimal") + val location = new File(url.getFile) + + val tableName = "tab1" + val avroSchema = +"""{ + | "name": "test_record", + | "type": "record", + | "fields": [ { + |"name": "f0", + |"type": [ + | "null", + | { + |"precision": 38, + |"scale": 2, + |"type": "bytes", + |"logicalType": "decimal" + | } + |] + | } ] + |} +""".stripMargin + withTable(tableName) { +versionSpark.sql( + s""" + |CREATE TABLE $tableName + |ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' + |WITH SERDEPROPERTIES ('respectSparkSchema' = 'true') + |STORED AS + | INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' + | OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' + |LOCATION '$location' + |TBLPROPERTIES ('avro.schema.literal' = '$avroSchema') --- End diff -- There was an argument about whether we should add `TBLPROPERTIES`, and we decided to not add it. I'm totally fine to add it if it's necessary. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19003: [SPARK-21769] [SQL] Add a table-specific option f...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19003#discussion_r134129199 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala --- @@ -763,6 +763,47 @@ class VersionsSuite extends SparkFunSuite with Logging { } } +test(s"$version: read avro file containing decimal") { + val url = Thread.currentThread().getContextClassLoader.getResource("avroDecimal") + val location = new File(url.getFile) + + val tableName = "tab1" + val avroSchema = +"""{ + | "name": "test_record", + | "type": "record", + | "fields": [ { + |"name": "f0", + |"type": [ + | "null", + | { + |"precision": 38, + |"scale": 2, + |"type": "bytes", + |"logicalType": "decimal" + | } + |] + | } ] + |} +""".stripMargin + withTable(tableName) { +versionSpark.sql( + s""" + |CREATE TABLE $tableName + |ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' + |WITH SERDEPROPERTIES ('respectSparkSchema' = 'true') + |STORED AS + | INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' + | OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' + |LOCATION '$location' + |TBLPROPERTIES ('avro.schema.literal' = '$avroSchema') --- End diff -- For such an example that requires users setting `TBLPROPERTIES`, it sounds like we are unable to use the `CREATE TABLE USING` command. cc @cloud-fan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19003: [SPARK-21769] [SQL] Add a table-specific option f...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/19003 [SPARK-21769] [SQL] Add a table-specific option for always respecting schemas inferred/controlled by Spark SQL ## What changes were proposed in this pull request? For Hive-serde tables, we always respect the schema stored in Hive metastore, because the schema could be altered by the other engines that share the same metastore. Thus, we always trust the metastore-controlled schema for Hive-serde tables when the schemas are different (without considering the nullability and cases). However, in some scenarios, Hive metastore also could INCORRECTLY overwrite the schemas when the serde and Hive metastore built-in serde are different. The proposed solution is to introduce a table-specific option for such scenarios. For a specific table, users can make Spark always respect Spark-inferred/controlled schema instead of trusting metastore-controlled schema. By default, we trust Hive metastore-controlled schema. ## How was this patch tested? Added a cross-version test case You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark respectSparkSchema Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19003.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19003 commit 4c7349f5d7cef703e11d93e114c8361a940e8bfa Author: gatorsmileDate: 2017-08-20T03:17:12Z fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org