zhongyujiang opened a new pull request, #10090: URL: https://github.com/apache/iceberg/pull/10090
This PR refactors three Parquet row-group filters into a form that computes residual expressions, allowing it to return a residual expression for the given row-groups. The residual computed by the previous filter can be passed to the next filter, allowing the three Parquet row-group filters to work together. This improves the handling of some `OR` condition queries. For example:
Let's assume we have a query `a = 'foo' OR b = 'bar'`, where column a is dictionary-encoded in a Parquet row-group, while column b is not entirely dictionary-encoded in all data pages but has a bloom filter. Therefore, `a = 'foo'` can only be evaluated by the dictionary filter, and `b = 'bar'` can only be evaluated by the bloom filter. In the current situation, even if both filters evaluate the expressions as `ROWS_CANNOT_MATCH` individually, because each filter can only evaluate one sub-expression, the final output would still be `ROWS_MIGHT_MATCH` (let's assume the metric filter evaluates both sub-expressions as `ROWS_MIGHT_MATCH`). After refactoring into the form of computing residuals, the dictionary filter will compute the residual for `a = 'foo' OR b = 'bar'` as `b = 'bar'`. Then this residual expression will be passed to the bloom filter and evaluated as `Expressions.alwaysFalse()`. As a result, the reading of this row-group can be skipped. This is a revive of #6893, and can close #10029. cc @cccs-jc @rdblue @huaxingao @amogh-jahagirdar @RussellSpitzer Could you please review this? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org