wgtmac commented on code in PR #1023:
URL: https://github.com/apache/parquet-mr/pull/1023#discussion_r1113972176


##########
parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java:
##########
@@ -289,8 +320,14 @@ public <T extends Comparable<T>> Boolean visit(Lt<T> lt) {
 
     T value = lt.getValue();
 
-    // drop if value <= min
-    return stats.compareMinToValue(value) >= 0;
+    // we are looking for records where v < someValue
+    if (stats.compareMinToValue(value) >= 0) {
+      // drop if value <= min
+      return BLOCK_CANNOT_MATCH;
+    } else {
+      // if value > min, we must take it
+      return BLOCK_MUST_MATCH;

Review Comment:
   > @yabola, I think you misunderstand how dictionary/filtering works. The 
dictionary contains all of the values which the dictionary encoded pages may 
contain. These pages actually not contain the values but the indices 
referencing to the related values in the dictionary.
   > 
   > So, if a searched element can be found in the dictionary you may return 
`BLOCK_MUST_MATCH` even if only one page is dictionary encoded. For example if 
the filter is `x > 1` then any element in the dictionary `> 1` would fulfill 
the filter so the `BLOCK_MUST_MATCH`. If the dictionary does not contain any of 
the searched elements (for the previous example every elements are `<= 1`) then 
you may return `BLOCK_CANNOT_MATCH` only if all the related pages are 
dictionary encoded. Otherwise you return `BLOCK_MIGHT_MATCH` since you don't 
know anything about the not dictionary encoded pages based on the dictionary.
   
   CMIW, the `DictionaryFilter` is only enabled when all data pages are 
dictionary-encoded in a certain column chunk. So @yabola 's statement seems 
work to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to