alamb opened a new issue, #8845:
URL: https://github.com/apache/arrow-rs/issues/8845

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   After the great work from @hhhizzz  in 
https://github.com/apache/arrow-rs/pull/8733, we (finally) have the ability to 
use a Bitmask filter representation when applying filters *during* Parquet 
decode. 🎉  
   
   However, as @hhhizzz explains in 
https://github.com/apache/arrow-rs/pull/8733#discussion_r2480294162 using the 
bitmask filter approach will actually fetch unecessary data pages in some 
cases. in ASCII art:
   
   ```text
                                ┏━━━━┓ ┌────────┐                               
   - '1' means row is selected  ┃ 1  ┃ │ Row 0  │                               
   - '0' means row is filtered  ┃ 0  ┃ │ Row 1  │                               
                                ┃ 0  ┃ │ Row 2  │  Page 0                       
                                ┃ 1  ┃ │ Row 3  │                               
                                ┃ 0  ┃ │ Row 4  │                               
                                ┃    ┃ └────────┘                               
                                ┃    ┃ ┌────────┐                               
                                ┃ 0  ┃ │ Row 5  │                               
                                ┃ 0  ┃ │ Row 6  │                               
   Filter only selects Row 0,   ┃ 0  ┃ │ Row 7  │                               
   Row 3, and Row 21. All other ┃ 0  ┃ │ Row 8  │                               
   Rows are filtered out        ┃ 0  ┃ │ Row 9  │            Filter selects no  
                                ┃ 0  ┃ │ Row 10 │  Page 1    rows from          
                                ┃ 0  ┃ │ Row 11 │            Page 1, but the    
                                ┃ 0  ┃ │ Row 12 │            current Filter mask
                                ┃ 0  ┃ │ Row 13 │            strategy requires  
                                ┃ 0  ┃ │ Row 14 │                               
                                ┃ 0  ┃ │ Row 15 │                               
                                ┃    ┃ └────────┘                               
                                ┃    ┃ ┌────────┐                               
                                ┃ 0  ┃ │ Row 16 │                               
                                ┃ 0  ┃ │ Row 17 │                               
                                ┃ 0  ┃ │ Row 18 │  Page 2                       
                                ┃ 0  ┃ │ Row 19 │                               
                                ┃ 0  ┃ │ Row 20 │                               
                                ┃ 1  ┃ │ Row 21 │                               
                                ┃ 0  ┃ │ Row 22 │                               
                                ┗━━━━┛ └────────┘                               
                                         Data Pages                             
                                                                                
                               Filter                                           
                               BitMask                                          
   ```
   
   When evaluating using `RowSelection` the pages are not loaded at all, as the 
existing `skip_records()` machinery handles the case and skips reading any data 
from the pages accordingly.
   
   However, when evaluating with a mask, the mask may include a page which was 
entirely ruled out (has no matching rows)
   
   > A simple example:
   > the page size is 2, the mask is 100001, row selection should be read(1) 
skip(4) read(1)
   > the ColumnChunkData would be page1(10), page2(skipped), page3(01)
   > Using the rowselection to skip(4), the page2 won't be read at all.
   > But using the bit mask, we need all 6 value be read, but the page2 is not 
in the memory, which is why I need to construct this synthetic page.
   
   The current code handles this case by falling back to a selector strategy 
when masks would straddle page boundaries:
   
   
https://github.com/apache/arrow-rs/blob/911331aafa13f5e230440cf5d02feb245985c64e/parquet/src/arrow/push_decoder/reader_builder/mod.rs#L681-L730
   
   **Describe the solution you'd like**
   I would like the parquet decoder to be able to use the mask evaluation 
strategy *AND* skip pages. 
   
   **Describe alternatives you've considered**
   
   One option (from @tustvold ) is to make the mask iteration page aware, so 
that we don't evaluate the predicate when we can rule out the entire page. 
Perhaps we can teach the bitmask iteration to be smarter in this case.
   
   **Additional context**
   Thoughts from @tustvold 
https://github.com/apache/arrow-rs/pull/8733#pullrequestreview-3407492068:
   
   > By definition the mask selection strategy requests rows that weren't part 
of the original selection, the problem is that this could result in requesting 
rows for pages that we know are irrelevant. In some cases this just results in 
wasted IO, however, when using prefetching IO systems (such as 
AsyncParquetReader) this results in errors. 
   >
   > I think a better solution would be to ensure we only construct MaskChunk 
that don't cross page boundaries. Ideally this would be done on a per-leaf 
column basis, but tbh I suspect just doing it globally would probably work just 
fine.
   >
   > Edit: If one was feeling fancy, one could ignore page boundaries where 
both pages were present in the original selection, although in practice I 
suspect this not to make a huge difference.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to