alamb opened a new issue, #8845: URL: https://github.com/apache/arrow-rs/issues/8845
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** After the great work from @hhhizzz in https://github.com/apache/arrow-rs/pull/8733, we (finally) have the ability to use a Bitmask filter representation when applying filters *during* Parquet decode. 🎉 However, as @hhhizzz explains in https://github.com/apache/arrow-rs/pull/8733#discussion_r2480294162 using the bitmask filter approach will actually fetch unecessary data pages in some cases. in ASCII art: ```text ┏━━━━┓ ┌────────┐ - '1' means row is selected ┃ 1 ┃ │ Row 0 │ - '0' means row is filtered ┃ 0 ┃ │ Row 1 │ ┃ 0 ┃ │ Row 2 │ Page 0 ┃ 1 ┃ │ Row 3 │ ┃ 0 ┃ │ Row 4 │ ┃ ┃ └────────┘ ┃ ┃ ┌────────┐ ┃ 0 ┃ │ Row 5 │ ┃ 0 ┃ │ Row 6 │ Filter only selects Row 0, ┃ 0 ┃ │ Row 7 │ Row 3, and Row 21. All other ┃ 0 ┃ │ Row 8 │ Rows are filtered out ┃ 0 ┃ │ Row 9 │ Filter selects no ┃ 0 ┃ │ Row 10 │ Page 1 rows from ┃ 0 ┃ │ Row 11 │ Page 1, but the ┃ 0 ┃ │ Row 12 │ current Filter mask ┃ 0 ┃ │ Row 13 │ strategy requires ┃ 0 ┃ │ Row 14 │ ┃ 0 ┃ │ Row 15 │ ┃ ┃ └────────┘ ┃ ┃ ┌────────┐ ┃ 0 ┃ │ Row 16 │ ┃ 0 ┃ │ Row 17 │ ┃ 0 ┃ │ Row 18 │ Page 2 ┃ 0 ┃ │ Row 19 │ ┃ 0 ┃ │ Row 20 │ ┃ 1 ┃ │ Row 21 │ ┃ 0 ┃ │ Row 22 │ ┗━━━━┛ └────────┘ Data Pages Filter BitMask ``` When evaluating using `RowSelection` the pages are not loaded at all, as the existing `skip_records()` machinery handles the case and skips reading any data from the pages accordingly. However, when evaluating with a mask, the mask may include a page which was entirely ruled out (has no matching rows) > A simple example: > the page size is 2, the mask is 100001, row selection should be read(1) skip(4) read(1) > the ColumnChunkData would be page1(10), page2(skipped), page3(01) > Using the rowselection to skip(4), the page2 won't be read at all. > But using the bit mask, we need all 6 value be read, but the page2 is not in the memory, which is why I need to construct this synthetic page. The current code handles this case by falling back to a selector strategy when masks would straddle page boundaries: https://github.com/apache/arrow-rs/blob/911331aafa13f5e230440cf5d02feb245985c64e/parquet/src/arrow/push_decoder/reader_builder/mod.rs#L681-L730 **Describe the solution you'd like** I would like the parquet decoder to be able to use the mask evaluation strategy *AND* skip pages. **Describe alternatives you've considered** One option (from @tustvold ) is to make the mask iteration page aware, so that we don't evaluate the predicate when we can rule out the entire page. Perhaps we can teach the bitmask iteration to be smarter in this case. **Additional context** Thoughts from @tustvold https://github.com/apache/arrow-rs/pull/8733#pullrequestreview-3407492068: > By definition the mask selection strategy requests rows that weren't part of the original selection, the problem is that this could result in requesting rows for pages that we know are irrelevant. In some cases this just results in wasted IO, however, when using prefetching IO systems (such as AsyncParquetReader) this results in errors. > > I think a better solution would be to ensure we only construct MaskChunk that don't cross page boundaries. Ideally this would be done on a per-leaf column basis, but tbh I suspect just doing it globally would probably work just fine. > > Edit: If one was feeling fancy, one could ignore page boundaries where both pages were present in the original selection, although in practice I suspect this not to make a huge difference. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
