[ 
https://issues.apache.org/jira/browse/PARQUET-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17687115#comment-17687115
 ] 

ASF GitHub Bot commented on PARQUET-2237:
-----------------------------------------

yabola commented on PR #1023:
URL: https://github.com/apache/parquet-mr/pull/1023#issuecomment-1425942666

   @wgtmac Sorry, `Boolean` type has to be used here, so that we can 
distinguish the `BLOCK_MIGHT_MATCH` and `BLOCK_MUST_MATCH`. This is example:
   ```
   Boolean b1 = new Boolean(true);
   Boolean b2 = new Boolean(true);
   boolean b3 = true;
   boolean b4 = true;
   
   assert b1 != b2;
   assert b1.equals(b2);
   assert b2 == b3 == b4;
   ```




> Improve performance when filters in RowGroupFilter can match exactly
> --------------------------------------------------------------------
>
>                 Key: PARQUET-2237
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2237
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Mars
>            Priority: Major
>
> If we can accurately judge by the minMax status, we don’t need to load the 
> dictionary from filesystem and compare one by one anymore.
> Similarly , Bloomfilter needs to load from filesystem, it may costs time and 
> memory. If we can exactly determine the existence/nonexistence of the value 
> from minMax or dictionary filters , then we can avoid using Bloomfilter to 
> Improve performance.
> For example,
>  # read data greater than {{x1}} in the block, if minMax in status is all 
> greater than {{{}x1{}}}, then we don't need to read dictionary and compare 
> one by one.
>  # If we already have page dictionaries and have compared one by one, we 
> don't need to read BloomFilter and compare.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to