rdblue commented on a change in pull request #23622: [SPARK-26677][SQL] Disable 
dictionary filtering by default at Parquet
URL: https://github.com/apache/spark/pull/23622#discussion_r251248759
 
 

 ##########
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ##########
 @@ -314,6 +314,17 @@ class ParquetFileFormat
       SQLConf.CASE_SENSITIVE.key,
       sparkSession.sessionState.conf.caseSensitiveAnalysis)
 
+    // There are two things to note here.
+    //
+    // 1. Dictionary filtering has an issue about the predication on null. For 
this reason,
+    //   This filtering is disabled. See SPARK-26677.
+    //
+    // 2. We should disable 'parquet.filter.dictionary.enabled' but
+    //   the 'parquet.filter.stats.enabled' and 
'parquet.filter.dictionary.enabled' were
+    //   swapped mistakenly in Parquet side. It should use 
'parquet.filter.dictionary.enabled'
+    //   when Spark upgrades Parquet. See PARQUET-1309.
+    hadoopConf.setIfUnset(ParquetInputFormat.STATS_FILTERING_ENABLED, "false")
 
 Review comment:
   Dictionary filtering has been available for more than 2 years and Netflix 
has been using it by default for almost 3 years without problems. The feature 
is stable.
   
   Keep in mind what a narrow use case hits this bug. First, an entire row 
group in a file needs to have a just one value and nulls: either the column has 
just one non-null value for all rows, or a sort or write pattern has clustered 
null rows with one other value (enough for an entire row group). Next, a query 
must use null-safe equality *and* negate it. In my experience, most people 
don't know what null-safe equality is. Going back to the use cases that produce 
data like this, the first use case -- only one non-null value -- would probably 
result in filter like `col NOT NULL` instead. The second write pattern is the 
problematic one, but how likely is the intersection of data with that write 
pattern and searching for all values except one with null-safe equality? I 
don't think it is likely and I think that's why we haven't found this until now.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to