[
https://issues.apache.org/jira/browse/PARQUET-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684185#comment-17684185
]
ASF GitHub Bot commented on PARQUET-2237:
-----------------------------------------
wgtmac commented on code in PR #1023:
URL: https://github.com/apache/parquet-mr/pull/1023#discussion_r1096546533
##########
parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/RowGroupFilter.java:
##########
@@ -98,16 +99,19 @@ public List<BlockMetaData>
visit(FilterCompat.FilterPredicateCompat filterPredic
for (BlockMetaData block : blocks) {
boolean drop = false;
+ // Whether one filter can exactly determine the existence/nonexistence
of the value.
+ // If true then we can skip the remaining filters to save time and space.
+ AtomicBoolean canExactlyDetermine = new AtomicBoolean(false);
Review Comment:
Why atomic?
##########
parquet-hadoop/src/test/java/org/apache/parquet/filter2/dictionarylevel/DictionaryFilterTest.java:
##########
@@ -792,6 +793,16 @@ public void testInverseUdpMissingColumn() throws Exception
{
canDrop(LogicalInverseRewriter.rewrite(not(userDefined(fake,
nullRejecting))), ccmd, dictionaries));
}
+ @Test
+ public void testCanSkipOtherFilters() {
Review Comment:
The test looks a little bit insufficient. More kinds of predicates and
compound predicates need to be covered. Also test of `RowGroupFilter` is
missing.
##########
parquet-hadoop/src/main/java/org/apache/parquet/filter2/dictionarylevel/DictionaryFilter.java:
##########
@@ -559,4 +591,12 @@ private static boolean
hasNonDictionaryPages(ColumnChunkMetaData meta) {
return true;
}
}
+
+ private <T extends Comparable<T>> void markCanExactlyDetermine(Set<T>
dictSet) {
+ if (dictSet == null) {
+ canExactlyDetermine = false;
Review Comment:
It seems that `canExactlyDetermine` should use `OR` to update its value.
Otherwise, any predicate with a null dict will set it to false even if previous
predicates have marked it to true.
Additionally, we may have a chance to shortcut the evaluation as well if any
predicate has set it to true.
##########
parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/RowGroupFilter.java:
##########
@@ -98,16 +99,19 @@ public List<BlockMetaData>
visit(FilterCompat.FilterPredicateCompat filterPredic
for (BlockMetaData block : blocks) {
boolean drop = false;
+ // Whether one filter can exactly determine the existence/nonexistence
of the value.
+ // If true then we can skip the remaining filters to save time and space.
+ AtomicBoolean canExactlyDetermine = new AtomicBoolean(false);
Review Comment:
I'd suggest rename `canExactlyDetermine` to `preciselyDetermined`. Or even
better, use an enum something like below
```java
enum PredicateEvaluation {
CAN_DROP, /* the block can be dropped for sure */
CANNOT_DROP, /* the block cannot be dropped for sure*/
MAY_DROP, /* cannot decide yet, may be dropped by other filter levels
*/
}
```
In this way, we can merge the the two boolean values here. The downside is
that the code may need more refactoring to add the enum value to different
filter classes.
> Improve performance when filters in RowGroupFilter can match exactly
> --------------------------------------------------------------------
>
> Key: PARQUET-2237
> URL: https://issues.apache.org/jira/browse/PARQUET-2237
> Project: Parquet
> Issue Type: Improvement
> Reporter: Mars
> Priority: Major
>
> Bloomfilter needs to load from filesystem, it may costs time and space. If we
> can exactly determine the existence/nonexistence of the value from other
> filters , then we can avoid using Bloomfilter to Improve performance.
>
> When the minMax values in StatisticsFilter is same, we can exactly determine
> the existence/nonexistence of the value.
> When we have page dictionaries, we can also determine the
> existence/nonexistence of the value.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)