[I] [Epic] Parquet Reader Improvement Plan / Proposal - July 2025 [arrow-rs]

via GitHub Fri, 25 Jul 2025 07:46:58 -0700


alamb opened a new issue, #8000:
URL: https://github.com/apache/arrow-rs/issues/8000


   This is my attempt to summarize my proposal / plan for improving the Parquet 
reader in arrow-rs.
   
   Summary
   - [ ] Merge predicate cache for async reader: 
https://github.com/apache/arrow-rs/pull/7850
   - [ ] Work on better row selection representation: 
https://github.com/apache/arrow-rs/issues/5523
   - [ ] Create a "push decoder" API for find grained control of IO and CPU 
work (and predicate caching): https://github.com/apache/arrow-rs/issues/7983
   - [ ] Refactor sync and async readers to use the push decoder
   
   
   ## Background
   The current arrow-rs reader is great, but has two major areas that I think 
should be improved:
   1. Speed / efficiency when evaluating pushdown predicates 
([`RowFilter`](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowFilter.html),
 aka late materalization, see https://github.com/apache/datafusion/issues/3463 
in DataFusion): https://github.com/apache/arrow-rs/issues/7456
   2. Flexibility in how the reader intermixes IO and CPU work, so that users 
can better control the interleaving of IO and CPU work *during* decode.
   
   To fix pushdown predicates we need two things:
   1. A better representation when predicate results are not clustered togehter 
(this is just a software engineering exercise I think, see - 
https://github.com/apache/arrow-rs/issues/5523)
   2. Avoid decoding the same column twice when it is needed for a row filter 
and the output (this is more complex as it requires more resources, and thus a 
tradeoff)
   
   ## Hard Coded Tradeoffs between IO, CPU and Memory
   
   Avoiding double decoding requires buffering the results of predicate 
evaluation,
   which is exactly what @XiangpengHao has done in
   https://github.com/apache/arrow-rs/pull/7850. However, the approach in that 
PR
   requires buffering *all values* for predicate columns within a row group.
   Furthermore, we only added caching for the
   async reader 
([`ParquetRecordBatchStream`](https://docs.rs/parquet/latest/parquet/arrow/async_reader/struct.ParquetRecordBatchStream.html))
   because the sync reader 
([`ParquetRecordBatchReader`](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ParquetRecordBatchReader.html))
   evaluates filters across *all row groups* before
   decoding any data, which means that the cache would likely require too much
   memory to be practical in most cases.
   
   After study, I don't think there is any way to reduce the cache memory 
required
   in the async reader or the sync readers, without changing the IO patterns.
   Specifically, the current async reader evaluates a filter across all rows in 
a
   row group before using a **single subsequent IO** to read data for all rows
   which will be output. This optimization for for a single subsequent IO means
   selective filters can reduce IO for later columns significantly, but it also
   means that the cache must hold the filter results for all the rows in the
   RowGroup or the the decoder must decode the column twice (once for the filter
   and once for the output), as it today.
   
   In addition to the new caching, I think users may wish to make different 
tradeoffs between IO and memory usage:
   1. More IOs and less internal buffering  (e.g when reading from a local SSD 
or an in memory cache)
   2. Fewer IOs and more internal buffering (e.g. when reading directly from 
remote object store)
   
   To reduce memory requirements for the cached reader, we will need to change 
the
   IO patterns. For the sync reader it may also be desired to avoid decoding the
   column for the entire row group, and instead decode the column in smaller
   chunks. However, in some cases this would result in more individual IO
   operations, which could be slower depending on the storage system.
   
   To that end, I am more convinced than ever that it is time to create a "push 
decoder" that permits users to
   control the IO and CPU work separately (see 
https://github.com/apache/arrow-rs/issues/7983)
   
   
   
   Requests from reviewers:
   1. Do you think the proposal makes sense?
   2. Are you willing to help implement it?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [Epic] Parquet Reader Improvement Plan / Proposal - July 2025 [arrow-rs]

Reply via email to