[I] Support customizing row group reading process in async reader [arrow-rs]

via GitHub Tue, 28 Nov 2023 19:06:30 -0800


Rachelint opened a new issue, #5141:
URL: https://github.com/apache/arrow-rs/issues/5141


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   <!--
   A clear and concise description of what the problem is. Ex. I'm always 
frustrated when [...] 
   (This section helps Arrow developers understand the context and *why* for 
this feature, in addition to  the *what*)
   -->
   I found decompression cost much cpu when using 
[horaedb](https://github.com/CeresDB/horaedb) in production, so I decide to 
refactor our parquet memory cache to cache the decompressed pages rather than 
only raw bytes. However, I found can only finish it using the low level apis 
and unable to reuse a ton of codes. 
   
   Actually, I found [greptimedb](https://github.com/GreptimeTeam/greptimedb) 
has the similar need for such a cache, and copy too many codes from parquet to 
finish it [#2688](https://github.com/GreptimeTeam/greptimedb/pull/2688).
   
   I think maybe we can make the row group reading process an interface to 
support customizing for reusing the rest codes.
   
   
   **Describe the solution you'd like**
   <!--
   A clear and concise description of what you want to happen.
   -->
   As I see, one row group's reading process maybe can be concluded as 
following:
   + calculate the pages ranges in row group.
   + fetch pages according to above rangs(compressed ranges).
   + decompress the pages.
   + decoding the pages.
   
   Maybe we can define a trait as following, it interacts with other parts like 
this: 
   + calculate and pass the pages ranges to `get_row_group` to fetch compressed 
ranges and return the in memory row group(which impls RowGroups trait).
   + call `column_chunks` of in memory row group to generate the decompressed 
page iterator as same as the original process.
   
   And for we users can provide the customized `AsyncRowGroupReader` and 
`RowGroups` impls for reaching our targets (such as decompressed page cahce 
mentioned above).
   
   ```
   pub trait AsyncRowGroupReader {
       type R: RowGroups;
   
       async fn get_row_group<T: AsyncFileReader + Send>(
           &mut self,
           input: &mut T,
           row_group_idx: usize,
           row_group_offsets: RowGroupRanges,
       ) -> Result<Self::R>;
   }
   ```
   
   **Describe alternatives you've considered**
   <!--
   A clear and concise description of any alternative solutions or features 
you've considered.
   -->
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Support customizing row group reading process in async reader [arrow-rs]

Reply via email to