jp0317 commented on PR #36510: URL: https://github.com/apache/arrow/pull/36510#issuecomment-1627482654
> Thanks for making a PR! However, I am really in doubt about its effectiveness. As the buffered stream does not know the page boundary, it can only issue next read in its best effort. As the caller, we don't know the page size either to tune the buffer_size here. IMO, the best solution is to equip PageReader with page index if available, and plan the read range based on the offset index of each page. WDYT? Thanks for your review! I agree with the page offset but feel it might be different topic. The new `buffer_size` parameter in this PR is to allow users to customize buffer size for different column chunks, while currently all column chunks have to share the same buffer size from `read_properties`. On users side, they used to set a single buffer size for `read_properties`, with this PR they can further choose to customize a buffer size for specific column chunk. E.g., if the user wants to limit the buffer memory to 64Mb when reading two chunks with 10MB and 100MB size, they now can assign 10Mb for the smaller chunk and the rest 54Mb for the larger one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
