[GitHub] [parquet-mr] yabola commented on a diff in pull request #1023: PARQUET-2237 Improve performance when filters in RowGroupFilter can match exactly

via GitHub Wed, 22 Feb 2023 06:08:14 -0800


yabola commented on code in PR #1023:
URL: https://github.com/apache/parquet-mr/pull/1023#discussion_r1114365987



##########
parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java:
##########
@@ -289,8 +320,14 @@ public <T extends Comparable<T>> Boolean visit(Lt<T> lt) {
 
     T value = lt.getValue();
 
-    // drop if value <= min
-    return stats.compareMinToValue(value) >= 0;
+    // we are looking for records where v < someValue
+    if (stats.compareMinToValue(value) >= 0) {
+      // drop if value <= min
+      return BLOCK_CANNOT_MATCH;
+    } else {
+      // if value > min, we must take it
+      return BLOCK_MUST_MATCH;

Review Comment:
   @gszadovszky @wgtmac @zhongyujiang Thank you very much for working on it.  I 
have some thoughts.
   
   > We can improve (?) the case when not all the pages are dictionary encoded
   
   1. I can't make sure if it is suitable to load dictionary even if pages are 
not all decoded. (I may choose not to change this behavior)
   
   2. However considering the origin `BloomFilter` bug in parquet v1, we might 
have to do something to avoid using `BloomFilter`(even if pages are all 
encoded).  
   In the code implementation we may have to use some flag to mark if 
dictionary `DictionaryFilter#expandDictionary` successfully (method will throw 
`IOException` and we can't `expandDictionary` again in `BloomFilterImpl`).
   Or we could also use `BLOCK_MUST_MATCH` like this PR.
   
   > StatisticsFilter: Because of the lower/upper bound issue we cannot really 
improve this (except for the specific case when min=max)
   
   If we only use it when min=max, I think it might not really improve . 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1023: PARQUET-2237 Improve performance when filters in RowGroupFilter can match exactly

Reply via email to