alamb opened a new issue, #8000: URL: https://github.com/apache/arrow-rs/issues/8000
This is my attempt to summarize my proposal / plan for improving the Parquet reader in arrow-rs. Summary - [ ] Merge predicate cache for async reader: https://github.com/apache/arrow-rs/pull/7850 - [ ] Work on better row selection representation: https://github.com/apache/arrow-rs/issues/5523 - [ ] Create a "push decoder" API for find grained control of IO and CPU work (and predicate caching): https://github.com/apache/arrow-rs/issues/7983 - [ ] Refactor sync and async readers to use the push decoder ## Background The current arrow-rs reader is great, but has two major areas that I think should be improved: 1. Speed / efficiency when evaluating pushdown predicates ([`RowFilter`](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowFilter.html), aka late materalization, see https://github.com/apache/datafusion/issues/3463 in DataFusion): https://github.com/apache/arrow-rs/issues/7456 2. Flexibility in how the reader intermixes IO and CPU work, so that users can better control the interleaving of IO and CPU work *during* decode. To fix pushdown predicates we need two things: 1. A better representation when predicate results are not clustered togehter (this is just a software engineering exercise I think, see - https://github.com/apache/arrow-rs/issues/5523) 2. Avoid decoding the same column twice when it is needed for a row filter and the output (this is more complex as it requires more resources, and thus a tradeoff) ## Hard Coded Tradeoffs between IO, CPU and Memory Avoiding double decoding requires buffering the results of predicate evaluation, which is exactly what @XiangpengHao has done in https://github.com/apache/arrow-rs/pull/7850. However, the approach in that PR requires buffering *all values* for predicate columns within a row group. Furthermore, we only added caching for the async reader ([`ParquetRecordBatchStream`](https://docs.rs/parquet/latest/parquet/arrow/async_reader/struct.ParquetRecordBatchStream.html)) because the sync reader ([`ParquetRecordBatchReader`](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ParquetRecordBatchReader.html)) evaluates filters across *all row groups* before decoding any data, which means that the cache would likely require too much memory to be practical in most cases. After study, I don't think there is any way to reduce the cache memory required in the async reader or the sync readers, without changing the IO patterns. Specifically, the current async reader evaluates a filter across all rows in a row group before using a **single subsequent IO** to read data for all rows which will be output. This optimization for for a single subsequent IO means selective filters can reduce IO for later columns significantly, but it also means that the cache must hold the filter results for all the rows in the RowGroup or the the decoder must decode the column twice (once for the filter and once for the output), as it today. In addition to the new caching, I think users may wish to make different tradeoffs between IO and memory usage: 1. More IOs and less internal buffering (e.g when reading from a local SSD or an in memory cache) 2. Fewer IOs and more internal buffering (e.g. when reading directly from remote object store) To reduce memory requirements for the cached reader, we will need to change the IO patterns. For the sync reader it may also be desired to avoid decoding the column for the entire row group, and instead decode the column in smaller chunks. However, in some cases this would result in more individual IO operations, which could be slower depending on the storage system. To that end, I am more convinced than ever that it is time to create a "push decoder" that permits users to control the IO and CPU work separately (see https://github.com/apache/arrow-rs/issues/7983) Requests from reviewers: 1. Do you think the proposal makes sense? 2. Are you willing to help implement it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org