zhongyujiang opened a new pull request, #10090:
URL: https://github.com/apache/iceberg/pull/10090

   This PR refactors three Parquet row-group filters into a form that computes 
residual expressions, allowing it to return a residual expression for the given 
row-groups. The residual computed by the previous filter can be passed to the 
next filter, allowing the three Parquet row-group filters to work together. 
This improves the handling of some `OR` condition queries.
   
   For example:
Let's assume we have a query `a = 'foo' OR b = 'bar'`, where 
column a is dictionary-encoded in a Parquet row-group, while column b is not 
entirely dictionary-encoded in all data pages but has a bloom filter. 
Therefore, `a = 'foo'` can only be evaluated by the dictionary filter, and `b = 
'bar'` can only be evaluated by the bloom filter. In the current situation, 
even if both filters evaluate the expressions as `ROWS_CANNOT_MATCH` 
individually, because each filter can only evaluate one sub-expression, the 
final output would still be `ROWS_MIGHT_MATCH` (let's assume the metric filter 
evaluates both sub-expressions as `ROWS_MIGHT_MATCH`).
   After refactoring into the form of computing residuals, the dictionary 
filter will compute the residual for `a = 'foo' OR b = 'bar'` as `b = 'bar'`. 
Then this residual expression will be passed to the bloom filter and evaluated 
as `Expressions.alwaysFalse()`. As a result, the reading of this row-group can 
be skipped.
   
   This is a revive of #6893, and can close #10029.
   
   cc @cccs-jc  @rdblue @huaxingao @amogh-jahagirdar @RussellSpitzer Could you 
please review this? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to