[GitHub] spark pull request #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql...

2017-11-02 Thread dongjoon-hyun
Github user dongjoon-hyun closed the pull request at:

https://github.com/apache/spark/pull/19552


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql...

2017-10-23 Thread budde
Github user budde commented on a diff in the pull request:

https://github.com/apache/spark/pull/19552#discussion_r146416338
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -388,7 +388,7 @@ object SQLConf {
 .stringConf
 .transform(_.toUpperCase(Locale.ROOT))
 .checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString))
-
.createWithDefault(HiveCaseSensitiveInferenceMode.INFER_AND_SAVE.toString)
+.createWithDefault(HiveCaseSensitiveInferenceMode.NEVER_INFER.toString)
--- End diff --

```INFER_AND_SAVE``` was introduced to fix the issues presented in 
[SPARK-19611](https://issues.apache.org/jira/browse/SPARK-19611) that broke any 
table without the Spark-embedded table schema. This would break any table not 
created with Spark 2.0 or above, so it included tables created by older 
versions of Spark SQL (this was the situation we ran in to).

Some issues with how this affects other Hive table properties were 
uncovered in [SPARK-22306](https://issues.apache.org/jira/browse/SPARK-22306). 
These problems are resolved by falling back to the previous default of 
```NEVER_INFER``` that was used prior to Spark 2.2.0. This will mean that out 
of the box Spark still won't be compatible with Hive tables backed by 
case-sensitive data files that weren't created by Spark SQL 2.0 or above but 
will avoid mangling existing Hive table properties. This is meant as a short 
term fix until I can go back and debug/resolve the conflicts that are occurring.

I think these issues have highlighted how brittle of an approach that 
relying on Spark-specific Hive table properties is, especially since it's 
impossible to predict how other frameworks will utilize the table properties 
themselves, but I don't think there's any better way of doing this and we may 
just have to deal with conflicts like this as they arise.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql...

2017-10-23 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19552#discussion_r146385329
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -388,7 +388,7 @@ object SQLConf {
 .stringConf
 .transform(_.toUpperCase(Locale.ROOT))
 .checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString))
-
.createWithDefault(HiveCaseSensitiveInferenceMode.INFER_AND_SAVE.toString)
+.createWithDefault(HiveCaseSensitiveInferenceMode.NEVER_INFER.toString)
--- End diff --

We can improve the documentation instead of changing the default. 

If my understanding is right, this occurs only when Spark SQL tries to read 
the table created by the other tables.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql...

2017-10-22 Thread dongjoon-hyun
GitHub user dongjoon-hyun opened a pull request:

https://github.com/apache/spark/pull/19552

[SPARK-22329][SQL] Use NEVER_INFER for 
`spark.sql.hive.caseSensitiveInferenceMode` by default

## What changes were proposed in this pull request?

In Spark 2.2.0, `spark.sql.hive.caseSensitiveInferenceMode` has a critical 
issue by default. 

- [SPARK-19611](https://issues.apache.org/jira/browse/SPARK-19611) uses 
`INFER_AND_SAVE` at 2.2.0 since Spark 2.1.0 breaks some Hive tables backed by 
case-sensitive data files.

  > This situation will occur for any Hive table that wasn't created by 
Spark or that was created prior to Spark 2.1.0. If a user attempts to run a 
query over such a table containing a case-sensitive field name in the query 
projection or in the query filter, the query will return 0 results in every 
case.

- However, [SPARK-22306](https://issues.apache.org/jira/browse/SPARK-22306) 
reports this also corrupts Hive Metastore schema by removing bucketing 
information (BUCKETING_COLS, SORT_COLS) and changing owner. This is undesirable 
side-effects. Hive Metastore is a shared resource and Spark should not corrupt 
it by default. 

- Since Spark 2.3.0 supports Bucketing, BUCKETING_COLS and SORT_COLS look 
okay at least. However, we need to figure out the issue of changing owners. 
Also, we cannot backport bucketing patch into `branch-2.2`. We need to verify 
this option with more tests before releasing 2.3.0.

This PR proposes to recover that option back to `NEVER_INFO` like Spark 
2.2.0 by default. Users can take a risk by enabling `INFER_AND_SAVE` by 
themselves.

## How was this patch tested?

Pass the existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjoon-hyun/spark SPARK-22329

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19552.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19552


commit a256627dbc2772e69cd0f9f2aa43b384165e3657
Author: Dongjoon Hyun 
Date:   2017-10-22T17:59:15Z

[SPARK-22329][SQL] Use NEVER_INFER for 
`spark.sql.hive.caseSensitiveInferenceMode` by default




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org