[ 
https://issues.apache.org/jira/browse/SPARK-11153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-11153:
-------------------------------
    Description: 
Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be written 
with corrupted statistics information. This information is used by filter 
push-down optimization. Since Spark 1.5 turns on Parquet filter push-down by 
default, we may end up with wrong query results. PARQUET-251 has been fixed in 
parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.

Note that this kind of corrupted Parquet files could be produced by any Parquet 
data models.

This affects all Spark SQL data types that can be mapped to Parquet {{BINARY}}, 
namely:

- {{StringType}}
- {{BinaryType}}
- {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} 
columns for now.)

To avoid wrong query results, we should disable filter push-down for columns of 
{{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8.

  was:
Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be written 
with corrupted statistics information. This information is used by filter 
push-down optimization. Since Spark 1.5 turns on Parquet filter push-down by 
default, we may end up with wrong query results. PARQUET-251 has been fixed in 
parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.

This affects all Spark SQL data types that can be mapped to Parquet {{BINARY}}, 
namely:

- {{StringType}}
- {{BinaryType}}
- {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} 
columns for now.)

To avoid wrong query results, we should disable filter push-down for columns of 
{{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8.


> Turns off Parquet filter push-down for string and binary columns
> ----------------------------------------------------------------
>
>                 Key: SPARK-11153
>                 URL: https://issues.apache.org/jira/browse/SPARK-11153
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0, 1.5.1
>            Reporter: Cheng Lian
>            Assignee: Cheng Lian
>            Priority: Blocker
>
> Due to PARQUET-251, {{BINARY}} columns in existing Parquet files may be 
> written with corrupted statistics information. This information is used by 
> filter push-down optimization. Since Spark 1.5 turns on Parquet filter 
> push-down by default, we may end up with wrong query results. PARQUET-251 has 
> been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.
> Note that this kind of corrupted Parquet files could be produced by any 
> Parquet data models.
> This affects all Spark SQL data types that can be mapped to Parquet 
> {{BINARY}}, namely:
> - {{StringType}}
> - {{BinaryType}}
> - {{DecimalType}} (but Spark SQL doesn't support pushing down {{DecimalType}} 
> columns for now.)
> To avoid wrong query results, we should disable filter push-down for columns 
> of {{StringType}} and {{BinaryType}} until we upgrade to parquet-mr 1.8.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to