[GitHub] spark pull request #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql...
Github user dongjoon-hyun closed the pull request at: https://github.com/apache/spark/pull/19552 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql...
Github user budde commented on a diff in the pull request: https://github.com/apache/spark/pull/19552#discussion_r146416338 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -388,7 +388,7 @@ object SQLConf { .stringConf .transform(_.toUpperCase(Locale.ROOT)) .checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString)) - .createWithDefault(HiveCaseSensitiveInferenceMode.INFER_AND_SAVE.toString) +.createWithDefault(HiveCaseSensitiveInferenceMode.NEVER_INFER.toString) --- End diff -- ```INFER_AND_SAVE``` was introduced to fix the issues presented in [SPARK-19611](https://issues.apache.org/jira/browse/SPARK-19611) that broke any table without the Spark-embedded table schema. This would break any table not created with Spark 2.0 or above, so it included tables created by older versions of Spark SQL (this was the situation we ran in to). Some issues with how this affects other Hive table properties were uncovered in [SPARK-22306](https://issues.apache.org/jira/browse/SPARK-22306). These problems are resolved by falling back to the previous default of ```NEVER_INFER``` that was used prior to Spark 2.2.0. This will mean that out of the box Spark still won't be compatible with Hive tables backed by case-sensitive data files that weren't created by Spark SQL 2.0 or above but will avoid mangling existing Hive table properties. This is meant as a short term fix until I can go back and debug/resolve the conflicts that are occurring. I think these issues have highlighted how brittle of an approach that relying on Spark-specific Hive table properties is, especially since it's impossible to predict how other frameworks will utilize the table properties themselves, but I don't think there's any better way of doing this and we may just have to deal with conflicts like this as they arise. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19552#discussion_r146385329 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -388,7 +388,7 @@ object SQLConf { .stringConf .transform(_.toUpperCase(Locale.ROOT)) .checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString)) - .createWithDefault(HiveCaseSensitiveInferenceMode.INFER_AND_SAVE.toString) +.createWithDefault(HiveCaseSensitiveInferenceMode.NEVER_INFER.toString) --- End diff -- We can improve the documentation instead of changing the default. If my understanding is right, this occurs only when Spark SQL tries to read the table created by the other tables. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19552: [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql...
GitHub user dongjoon-hyun opened a pull request: https://github.com/apache/spark/pull/19552 [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql.hive.caseSensitiveInferenceMode` by default ## What changes were proposed in this pull request? In Spark 2.2.0, `spark.sql.hive.caseSensitiveInferenceMode` has a critical issue by default. - [SPARK-19611](https://issues.apache.org/jira/browse/SPARK-19611) uses `INFER_AND_SAVE` at 2.2.0 since Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files. > This situation will occur for any Hive table that wasn't created by Spark or that was created prior to Spark 2.1.0. If a user attempts to run a query over such a table containing a case-sensitive field name in the query projection or in the query filter, the query will return 0 results in every case. - However, [SPARK-22306](https://issues.apache.org/jira/browse/SPARK-22306) reports this also corrupts Hive Metastore schema by removing bucketing information (BUCKETING_COLS, SORT_COLS) and changing owner. This is undesirable side-effects. Hive Metastore is a shared resource and Spark should not corrupt it by default. - Since Spark 2.3.0 supports Bucketing, BUCKETING_COLS and SORT_COLS look okay at least. However, we need to figure out the issue of changing owners. Also, we cannot backport bucketing patch into `branch-2.2`. We need to verify this option with more tests before releasing 2.3.0. This PR proposes to recover that option back to `NEVER_INFO` like Spark 2.2.0 by default. Users can take a risk by enabling `INFER_AND_SAVE` by themselves. ## How was this patch tested? Pass the existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjoon-hyun/spark SPARK-22329 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19552.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19552 commit a256627dbc2772e69cd0f9f2aa43b384165e3657 Author: Dongjoon HyunDate: 2017-10-22T17:59:15Z [SPARK-22329][SQL] Use NEVER_INFER for `spark.sql.hive.caseSensitiveInferenceMode` by default --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org