rdblue opened a new pull request, #6935:
URL: https://github.com/apache/iceberg/pull/6935

   This adds an evaluator, `ParquetIndexPageFilter.EvalVisitor` to evaluate an 
Iceberg Expression against Parquet's page indexes. That produces Parquet's 
`RowRanges`, which track ranges of rows that should be read from the Parquet 
file. The filter class, `ParquetIndexPageFilter` also sets the row ranges on 
the `ParquetFileReader`.
   
   Parquet doesn't expose some of the `RowRanges` methods publicly, so this has 
a `ParquetRanges` class that uses reflection to construct the ranges and to use 
them to filter `ParquetFileReader`.
   
   Also, **this does not update record materialization**. Record 
materialization will need to coordinate the values read across pages. 
Currently, it assumes that the pages returned are in sync, but that will change 
when Parquet filters pages.
   
   For example, if column A has just one page, `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]` 
and column B has two pages, `[a, b, c, d, e]` and `[f, g, h, i, j, k]`, and the 
row range is `6-8` then the results should be `(6, g)`, `(7, h)`, and `(8, i)`. 
The reader for column A will get the entire page and need to skip 6 values (0, 
1, 2, 3, 4, and 5), then the reader for column B will get just the second page 
and need to skip 1 value (f) to get in sync. The readers will need to handle 
this.
   
   I did this work a while ago and after looking at it with fresh eyes today, I 
have a couple concerns:
   1. I didn't get to writing tests for the `RowRanges` that are produced. This 
should be done next.
   2. The filter is applied at the row group level, but it looks like that's 
not how `RowRanges` works in Parquet, so this probably needs to be refactored 
to run the filter, produce `RowRanges`, and set them on the `ParquetFileReader` 
just once. In addition, if any row groups are completely eliminated by the 
filter, this will need to account for the missing row groups in the Iceberg 
Parquet reader.
   3. Since the filter is applied at the row group level, `allRows` may be 
incorrect (it is the row group's row count) and need to be adjusted to be the 
entire file rather than specific to a row group.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to