Rachelint opened a new issue, #5141: URL: https://github.com/apache/arrow-rs/issues/5141
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** <!-- A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] (This section helps Arrow developers understand the context and *why* for this feature, in addition to the *what*) --> I found decompression cost much cpu when using [horaedb](https://github.com/CeresDB/horaedb) in production, so I decide to refactor our parquet memory cache to cache the decompressed pages rather than only raw bytes. However, I found can only finish it using the low level apis and unable to reuse a ton of codes. Actually, I found [greptimedb](https://github.com/GreptimeTeam/greptimedb) has the similar need for such a cache, and copy too many codes from parquet to finish it [#2688](https://github.com/GreptimeTeam/greptimedb/pull/2688). I think maybe we can make the row group reading process an interface to support customizing for reusing the rest codes. **Describe the solution you'd like** <!-- A clear and concise description of what you want to happen. --> As I see, one row group's reading process maybe can be concluded as following: + calculate the pages ranges in row group. + fetch pages according to above rangs(compressed ranges). + decompress the pages. + decoding the pages. Maybe we can define a trait as following, it interacts with other parts like this: + calculate and pass the pages ranges to `get_row_group` to fetch compressed ranges and return the in memory row group(which impls RowGroups trait). + call `column_chunks` of in memory row group to generate the decompressed page iterator as same as the original process. And for we users can provide the customized `AsyncRowGroupReader` and `RowGroups` impls for reaching our targets (such as decompressed page cahce mentioned above). ``` pub trait AsyncRowGroupReader { type R: RowGroups; async fn get_row_group<T: AsyncFileReader + Send>( &mut self, input: &mut T, row_group_idx: usize, row_group_offsets: RowGroupRanges, ) -> Result<Self::R>; } ``` **Describe alternatives you've considered** <!-- A clear and concise description of any alternative solutions or features you've considered. --> **Additional context** <!-- Add any other context or screenshots about the feature request here. --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
