[GitHub] spark pull request #19003: [SPARK-21769] [SQL] Add a table-specific option f...

2017-08-22 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19003


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19003: [SPARK-21769] [SQL] Add a table-specific option f...

2017-08-22 Thread gatorsmile
Github user gatorsmile closed the pull request at:

https://github.com/apache/spark/pull/19003


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19003: [SPARK-21769] [SQL] Add a table-specific option f...

2017-08-22 Thread gatorsmile
GitHub user gatorsmile reopened a pull request:

https://github.com/apache/spark/pull/19003

[SPARK-21769] [SQL] Add a table-specific option for always respecting 
schemas inferred/controlled by Spark SQL

## What changes were proposed in this pull request?
For Hive-serde tables, we always respect the schema stored in Hive 
metastore, because the schema could be altered by the other engines that share 
the same metastore. Thus, we always trust the metastore-controlled schema for 
Hive-serde tables when the schemas are different (without considering the 
nullability and cases). However, in some scenarios, Hive metastore also could 
INCORRECTLY overwrite the schemas when the serde and Hive metastore built-in 
serde are different.

The proposed solution is to introduce a table-specific option for such 
scenarios. For a specific table, users can make Spark always respect 
Spark-inferred/controlled schema instead of trusting metastore-controlled 
schema. By default, we trust Hive metastore-controlled schema.

## How was this patch tested?
Added a cross-version test case

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark respectSparkSchema

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19003.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19003


commit 4c7349f5d7cef703e11d93e114c8361a940e8bfa
Author: gatorsmile 
Date:   2017-08-20T03:17:12Z

fix.

commit 36339c809a086fb1bb94ec167bf2fa9e4169aca1
Author: gatorsmile 
Date:   2017-08-22T18:22:26Z

fix.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19003: [SPARK-21769] [SQL] Add a table-specific option f...

2017-08-22 Thread sameeragarwal
Github user sameeragarwal commented on a diff in the pull request:

https://github.com/apache/spark/pull/19003#discussion_r134557690
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SourceOptions.scala
 ---
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
+
+/**
+ * Options for the Parquet data source.
--- End diff --

nit: update docs


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19003: [SPARK-21769] [SQL] Add a table-specific option f...

2017-08-22 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19003#discussion_r134392097
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala ---
@@ -763,6 +763,47 @@ class VersionsSuite extends SparkFunSuite with Logging 
{
   }
 }
 
+test(s"$version: read avro file containing decimal") {
+  val url = 
Thread.currentThread().getContextClassLoader.getResource("avroDecimal")
+  val location = new File(url.getFile)
+
+  val tableName = "tab1"
+  val avroSchema =
+"""{
+  |  "name": "test_record",
+  |  "type": "record",
+  |  "fields": [ {
+  |"name": "f0",
+  |"type": [
+  |  "null",
+  |  {
+  |"precision": 38,
+  |"scale": 2,
+  |"type": "bytes",
+  |"logicalType": "decimal"
+  |  }
+  |]
+  |  } ]
+  |}
+""".stripMargin
+  withTable(tableName) {
+versionSpark.sql(
+  s"""
+ |CREATE TABLE $tableName
+ |ROW FORMAT SERDE 
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
+ |WITH SERDEPROPERTIES ('respectSparkSchema' = 'true')
+ |STORED AS
+ |  INPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
+ |  OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
+ |LOCATION '$location'
+ |TBLPROPERTIES ('avro.schema.literal' = '$avroSchema')
--- End diff --

There was an argument about whether we should add `TBLPROPERTIES`, and we 
decided to not add it. I'm totally fine to add it if it's necessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19003: [SPARK-21769] [SQL] Add a table-specific option f...

2017-08-20 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19003#discussion_r134129199
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala ---
@@ -763,6 +763,47 @@ class VersionsSuite extends SparkFunSuite with Logging 
{
   }
 }
 
+test(s"$version: read avro file containing decimal") {
+  val url = 
Thread.currentThread().getContextClassLoader.getResource("avroDecimal")
+  val location = new File(url.getFile)
+
+  val tableName = "tab1"
+  val avroSchema =
+"""{
+  |  "name": "test_record",
+  |  "type": "record",
+  |  "fields": [ {
+  |"name": "f0",
+  |"type": [
+  |  "null",
+  |  {
+  |"precision": 38,
+  |"scale": 2,
+  |"type": "bytes",
+  |"logicalType": "decimal"
+  |  }
+  |]
+  |  } ]
+  |}
+""".stripMargin
+  withTable(tableName) {
+versionSpark.sql(
+  s"""
+ |CREATE TABLE $tableName
+ |ROW FORMAT SERDE 
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
+ |WITH SERDEPROPERTIES ('respectSparkSchema' = 'true')
+ |STORED AS
+ |  INPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
+ |  OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
+ |LOCATION '$location'
+ |TBLPROPERTIES ('avro.schema.literal' = '$avroSchema')
--- End diff --

For such an example that requires users setting `TBLPROPERTIES`, it sounds 
like we are unable to use the `CREATE TABLE USING` command. cc @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19003: [SPARK-21769] [SQL] Add a table-specific option f...

2017-08-19 Thread gatorsmile
GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/19003

[SPARK-21769] [SQL] Add a table-specific option for always respecting 
schemas inferred/controlled by Spark SQL

## What changes were proposed in this pull request?
For Hive-serde tables, we always respect the schema stored in Hive 
metastore, because the schema could be altered by the other engines that share 
the same metastore. Thus, we always trust the metastore-controlled schema for 
Hive-serde tables when the schemas are different (without considering the 
nullability and cases). However, in some scenarios, Hive metastore also could 
INCORRECTLY overwrite the schemas when the serde and Hive metastore built-in 
serde are different.

The proposed solution is to introduce a table-specific option for such 
scenarios. For a specific table, users can make Spark always respect 
Spark-inferred/controlled schema instead of trusting metastore-controlled 
schema. By default, we trust Hive metastore-controlled schema.

## How was this patch tested?
Added a cross-version test case

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark respectSparkSchema

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19003.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19003


commit 4c7349f5d7cef703e11d93e114c8361a940e8bfa
Author: gatorsmile 
Date:   2017-08-20T03:17:12Z

fix.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org