rdblue commented on a change in pull request #23622: [SPARK-26677][SQL] Disable dictionary filtering by default at Parquet URL: https://github.com/apache/spark/pull/23622#discussion_r251248759
########## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala ########## @@ -314,6 +314,17 @@ class ParquetFileFormat SQLConf.CASE_SENSITIVE.key, sparkSession.sessionState.conf.caseSensitiveAnalysis) + // There are two things to note here. + // + // 1. Dictionary filtering has an issue about the predication on null. For this reason, + // This filtering is disabled. See SPARK-26677. + // + // 2. We should disable 'parquet.filter.dictionary.enabled' but + // the 'parquet.filter.stats.enabled' and 'parquet.filter.dictionary.enabled' were + // swapped mistakenly in Parquet side. It should use 'parquet.filter.dictionary.enabled' + // when Spark upgrades Parquet. See PARQUET-1309. + hadoopConf.setIfUnset(ParquetInputFormat.STATS_FILTERING_ENABLED, "false") Review comment: Dictionary filtering has been available for more than 2 years and Netflix has been using it by default for almost 3 years without problems. The feature is stable. Keep in mind what a narrow use case hits this bug. First, an entire row group in a file needs to have a just one value and nulls: either the column has just one non-null value for all rows, or a sort or write pattern has clustered null rows with one other value (enough for an entire row group). Next, a query must use null-safe equality *and* negate it. In my experience, most people don't know what null-safe equality is. Going back to the use cases that produce data like this, the first use case -- only one non-null value -- would probably result in filter like `col NOT NULL` instead. The second write pattern is the problematic one, but how likely is the intersection of data with that write pattern and searching for all values except one with null-safe equality? I don't think it is likely and I think that's why we haven't found this until now. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org