[I] Pushdown filters to in-memory row group fetches [arrow-rs]

via GitHub Thu, 27 Mar 2025 11:07:39 -0700


ethe opened a new issue, #7348:
URL: https://github.com/apache/arrow-rs/issues/7348


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   So far, the Parquet Arrow reader provides two kinds of conditional 
retrieval/filtering:
   - Row selection: Offers select and skip methods based on row offsets, which 
can be pushed down to in-memory row group fetches.
   - Row filter: Only applies to record batches that have been read into memory 
and cannot be pushed down to in-memory row group fetches. Therefore, it cannot 
be used to skip fetching column chunks / pages that do not match the filter 
conditions.
   
   Although the Parquet format's statistics include min/max values, and 
optionally enabled sparse index information can be used to accelerate random 
reads and avoid unnecessary disk fetches, the row selection mechanism only 
supports operations related to row offsets. It lacks an API that allows users 
to declare filter conditions that can be pushed down into the fetch behavior, 
and the actual implementation of skipping column chunks / pages that do not 
match the filter conditions based on the index has not been realized.
   
   **Describe the solution you'd like**
   Add a third kind of method in addition to selection and filter. This new 
method allows users to specify an exact match for a column's value or a range 
of values, and to utilize indexing during in-memory row group fetches. This 
will reduce the reading of column chunks/pages that do not meet the filter 
conditions and improve random read efficiency.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Pushdown filters to in-memory row group fetches [arrow-rs]

Reply via email to