[ https://issues.apache.org/jira/browse/SPARK-19455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15868457#comment-15868457 ]
Adam Budde commented on SPARK-19455: ------------------------------------ Closing this in favor of https://issues.apache.org/jira/browse/SPARK-19611 > Add option for case-insensitive Parquet field resolution > -------------------------------------------------------- > > Key: SPARK-19455 > URL: https://issues.apache.org/jira/browse/SPARK-19455 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.1.0 > Reporter: Adam Budde > > [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the > schema inferrence from the HiveMetastoreCatalog class when converting a > MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in > favor of simply using the schema returend by the metastore. This results in > an optimization as the underlying file status no longer need to be resolved > until after the partition pruning step, reducing the number of files to be > touched significantly in some cases. The downside is that the data schema > used may no longer match the underlying file schema for case-sensitive > formats such as Parquet. > This change initially included a [patch to > ParquetReadSupport|https://github.com/apache/spark/blob/6ce1b675ee9fc9a6034439c3ca00441f9f172f84/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala#L270-L284] > that attempted to remedy this conflict by using a case-insentive fallback > mapping when resolving field names during the schema clipping step. > [SPARK-1833|https://issues.apache.org/jira/browse/SPARK-18333] later removed > this patch after > [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support > for embedding a case-sensitive schema as a Hive Metastore table property. > AFAIK the assumption here was that the data schema obtained from the > Metastore table property will be case sensitive and should match the Parquet > schema exactly. > The problem arises when dealing with Parquet-backed tables for which this > schema has not been embedded as a table attributes and for which the > underlying files contain case-sensitive field names. This will happen for any > Hive table that was not created by Spark or created by a version prior to > 2.1.0. We've seen Spark SQL return no results for any query containing a > case-sensitive field name for such tables. > The change we're proposing is to introduce a configuration parameter that > will re-enable case-insensitive field name resolution in ParquetReadSupport. > This option will also disable filter push-down for Parquet, as the filter > predicate constructed by Spark SQL contains the case-insensitive field names > which Parquet will return 0 records for when filtering against a > case-sensitive column name. I was hoping to find a way to construct the > filter on-the-fly in ParquetReadSupport but Parquet doesn't propegate the > Configuration object passed to this class to the underlying > InternalParquetRecordReader class. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org