[ https://issues.apache.org/jira/browse/SPARK-19611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870164#comment-15870164 ]
Reynold Xin commented on SPARK-19611: ------------------------------------- Rather than this fix, can we just save the case sensitive schema in the catalog? > Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files > ----------------------------------------------------------------------- > > Key: SPARK-19611 > URL: https://issues.apache.org/jira/browse/SPARK-19611 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.1.0 > Reporter: Adam Budde > > This issue replaces > [SPARK-19455|https://issues.apache.org/jira/browse/SPARK-19455] and [PR > #16797|https://github.com/apache/spark/pull/16797] > [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the > schema inferrence from the HiveMetastoreCatalog class when converting a > MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in > favor of simply using the schema returend by the metastore. This results in > an optimization as the underlying file status no longer need to be resolved > until after the partition pruning step, reducing the number of files to be > touched significantly in some cases. The downside is that the data schema > used may no longer match the underlying file schema for case-sensitive > formats such as Parquet. > Unfortunately, this silently breaks queries over tables where the underlying > data fields are case-sensitive but a case-sensitive schema wasn't written to > the table properties by Spark. This situation will occur for any Hive table > that wasn't created by Spark or that was created prior to Spark 2.1.0. If a > user attempts to run a query over such a table containing a case-sensitive > field name in the query projection or in the query filter, the query will > return 0 results in every case. > The change we are proposing is to bring back the schema inference that was > used prior to Spark 2.1.0 if a case-sensitive schema can't be read from the > table properties. > - INFER_AND_SAVE: Infer a schema from the data files if no case-sensitive > schema can be read from the table properties. Attempt to save the inferred > schema in the table properties to avoid future inference. > - INFER_ONLY: Infer the schema if no case-sensitive schema can be read but > don't attempt to save it. > - NEVER_INFER: Fall back to using the case-insensitive schema returned by the > Hive Metatore. Useful if the user knows that none of the underlying data is > case-sensitive. > See the discussion on [PR #16797|https://github.com/apache/spark/pull/16797] > for more discussion around this issue and the proposed solution. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org