[jira] [Commented] (PARQUET-2237) Improve performance when filters in RowGroupFilter can match exactly

ASF GitHub Bot (Jira) Sat, 04 Feb 2023 08:01:34 -0800


    [ 
https://issues.apache.org/jira/browse/PARQUET-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17684185#comment-17684185
 ]


ASF GitHub Bot commented on PARQUET-2237:
-----------------------------------------

wgtmac commented on code in PR #1023:
URL: https://github.com/apache/parquet-mr/pull/1023#discussion_r1096546533


##########
parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/RowGroupFilter.java:
##########
@@ -98,16 +99,19 @@ public List<BlockMetaData> 
visit(FilterCompat.FilterPredicateCompat filterPredic
 
     for (BlockMetaData block : blocks) {
       boolean drop = false;
+      // Whether one filter can exactly determine the existence/nonexistence 
of the value.
+      // If true then we can skip the remaining filters to save time and space.
+      AtomicBoolean canExactlyDetermine = new AtomicBoolean(false);

Review Comment:
   Why atomic?



##########
parquet-hadoop/src/test/java/org/apache/parquet/filter2/dictionarylevel/DictionaryFilterTest.java:
##########
@@ -792,6 +793,16 @@ public void testInverseUdpMissingColumn() throws Exception 
{
       canDrop(LogicalInverseRewriter.rewrite(not(userDefined(fake, 
nullRejecting))), ccmd, dictionaries));
   }
 
+  @Test
+  public void testCanSkipOtherFilters() {

Review Comment:
   The test looks a little bit insufficient. More kinds of predicates and 
compound predicates need to be covered. Also test of `RowGroupFilter` is 
missing.



##########
parquet-hadoop/src/main/java/org/apache/parquet/filter2/dictionarylevel/DictionaryFilter.java:
##########
@@ -559,4 +591,12 @@ private static boolean 
hasNonDictionaryPages(ColumnChunkMetaData meta) {
       return true;
     }
   }
+
+  private <T extends Comparable<T>> void markCanExactlyDetermine(Set<T> 
dictSet) {
+    if (dictSet == null) {
+      canExactlyDetermine = false;

Review Comment:
   It seems that `canExactlyDetermine` should use `OR` to update its value. 
Otherwise, any predicate with a null dict will set it to false even if previous 
predicates have marked it to true.
   
   Additionally, we may have a chance to shortcut the evaluation as well if any 
predicate has set it to true.



##########
parquet-hadoop/src/main/java/org/apache/parquet/filter2/compat/RowGroupFilter.java:
##########
@@ -98,16 +99,19 @@ public List<BlockMetaData> 
visit(FilterCompat.FilterPredicateCompat filterPredic
 
     for (BlockMetaData block : blocks) {
       boolean drop = false;
+      // Whether one filter can exactly determine the existence/nonexistence 
of the value.
+      // If true then we can skip the remaining filters to save time and space.
+      AtomicBoolean canExactlyDetermine = new AtomicBoolean(false);

Review Comment:
   I'd suggest rename `canExactlyDetermine` to `preciselyDetermined`. Or even 
better, use an enum something like below
   ```java
   enum PredicateEvaluation {
     CAN_DROP,     /* the block can be dropped for sure */
     CANNOT_DROP,  /* the block cannot be dropped for sure*/
     MAY_DROP,     /* cannot decide yet, may be dropped by other filter levels 
*/
   }
   ```
   In this way, we can merge the the two boolean values here. The downside is 
that the code may need more refactoring to add the enum value to different 
filter classes.





> Improve performance when filters in RowGroupFilter can match exactly
> --------------------------------------------------------------------
>
>                 Key: PARQUET-2237
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2237
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Mars
>            Priority: Major
>
> Bloomfilter needs to load from filesystem, it may costs time and space. If we 
> can  exactly determine the existence/nonexistence of the value from other 
> filters , then we can avoid using Bloomfilter to Improve performance.
>  
> When the minMax values in  StatisticsFilter is same, we can exactly determine 
> the existence/nonexistence of the value.
> When we have page dictionaries, we can also determine the 
> existence/nonexistence of the value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2237) Improve performance when filters in RowGroupFilter can match exactly

Reply via email to