[jira] [Commented] (SPARK-19455) Add option for case-insensitive Parquet field resolution

Adam Budde (JIRA) Wed, 15 Feb 2017 11:42:16 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-19455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15868457#comment-15868457
 ]


Adam Budde commented on SPARK-19455:
------------------------------------

Closing this in favor of https://issues.apache.org/jira/browse/SPARK-19611

> Add option for case-insensitive Parquet field resolution
> --------------------------------------------------------
>
>                 Key: SPARK-19455
>                 URL: https://issues.apache.org/jira/browse/SPARK-19455
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Adam Budde
>
> [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the 
> schema inferrence from the HiveMetastoreCatalog class when converting a 
> MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in 
> favor of simply using the schema returend by the metastore. This results in 
> an optimization as the underlying file status no longer need to be resolved 
> until after the partition pruning step, reducing the number of files to be 
> touched significantly in some cases. The downside is that the data schema 
> used may no longer match the underlying file schema for case-sensitive 
> formats such as Parquet.
> This change initially included a [patch to 
> ParquetReadSupport|https://github.com/apache/spark/blob/6ce1b675ee9fc9a6034439c3ca00441f9f172f84/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala#L270-L284]
>  that attempted to remedy this conflict by using a case-insentive fallback 
> mapping when resolving field names during the schema clipping step. 
> [SPARK-1833|https://issues.apache.org/jira/browse/SPARK-18333]  later removed 
> this patch after 
> [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support 
> for embedding a case-sensitive schema as a Hive Metastore table property. 
> AFAIK the assumption here was that the data schema obtained from the 
> Metastore table property will be case sensitive and should match the Parquet 
> schema exactly.
> The problem arises when dealing with Parquet-backed tables for which this 
> schema has not been embedded as a table attributes and for which the 
> underlying files contain case-sensitive field names. This will happen for any 
> Hive table that was not created by Spark or created by a version prior to 
> 2.1.0. We've seen Spark SQL return no results for any query containing a 
> case-sensitive field name for such tables.
> The change we're proposing is to introduce a configuration parameter that 
> will re-enable case-insensitive field name resolution in ParquetReadSupport. 
> This option will also disable filter push-down for Parquet, as the filter 
> predicate constructed by Spark SQL contains the case-insensitive field names 
> which Parquet will return 0 records for when filtering against a 
> case-sensitive column name. I was hoping to find a way to construct the 
> filter on-the-fly in ParquetReadSupport but Parquet doesn't propegate the 
> Configuration object passed to this class to the underlying 
> InternalParquetRecordReader class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19455) Add option for case-insensitive Parquet field resolution

Reply via email to