Henry Robinson created SPARK-23852: -------------------------------------- Summary: Parquet MR bug can lead to incorrect SQL results Key: SPARK-23852 URL: https://issues.apache.org/jira/browse/SPARK-23852 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Henry Robinson
Parquet 1.9.0 and 1.8.2 both have a bug, PARQUET-1217, that means that pushing certain predicates to Parquet scanners can return fewer results than they should. The bug triggers in Spark when: * The Parquet file being scanner has stats for the null count, but not the max or min on the column with the predicate (Apache Impala writes files like this). * The vectorized Parquet reader path is not taken, and the parquet-mr reader is used. * A suitable <, <=, > or >= predicate is pushed down to Parquet. The bug is that the parquet-mr interprets the max and min of a row-group's column as 0 in the absence of stats. So {{col > 0}} will filter all results, even if some are > 0. There is no upstream release of Parquet that contains the fix for PARQUET-1217, although a 1.10 release is planned. The least impactful workaround is to set the Parquet configuration {{parquet.filter.stats.enabled}} to {{false}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org