yabola commented on code in PR #1023: URL: https://github.com/apache/parquet-mr/pull/1023#discussion_r1114365987
########## parquet-hadoop/src/main/java/org/apache/parquet/filter2/statisticslevel/StatisticsFilter.java: ########## @@ -289,8 +320,14 @@ public <T extends Comparable<T>> Boolean visit(Lt<T> lt) { T value = lt.getValue(); - // drop if value <= min - return stats.compareMinToValue(value) >= 0; + // we are looking for records where v < someValue + if (stats.compareMinToValue(value) >= 0) { + // drop if value <= min + return BLOCK_CANNOT_MATCH; + } else { + // if value > min, we must take it + return BLOCK_MUST_MATCH; Review Comment: @gszadovszky @wgtmac @zhongyujiang Thank you very much for working on it. I have some thoughts. > We can improve (?) the case when not all the pages are dictionary encoded 1. I can't make sure if it is suitable to load dictionary even if pages are not all decoded. (I may choose not to change this behavior) 2. However considering the origin `BloomFilter` bug in parquet v1, we might have to do something to avoid using `BloomFilter`(even if pages are all encoded). In the code implementation we may have to use some flag to mark if dictionary `DictionaryFilter#expandDictionary` successfully (method will throw `IOException` and we can't `expandDictionary` again in `BloomFilterImpl`). Or we could also use `BLOCK_MUST_MATCH` like this PR. > StatisticsFilter: Because of the lower/upper bound issue we cannot really improve this (except for the specific case when min=max) If we only use it when min=max, I think it might not really improve . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org