[GitHub] [arrow] jp0317 commented on pull request #36510: PARQUET-2321: [C++] allow customized buffer size when creating ArrowInputStream for a column PageReader

via GitHub Sat, 08 Jul 2023 13:24:06 -0700


jp0317 commented on PR #36510:
URL: https://github.com/apache/arrow/pull/36510#issuecomment-1627482654


   > Thanks for making a PR! However, I am really in doubt about its 
effectiveness. As the buffered stream does not know the page boundary, it can 
only issue next read in its best effort. As the caller, we don't know the page 
size either to tune the buffer_size here. IMO, the best solution is to equip 
PageReader with page index if available, and plan the read range based on the 
offset index of each page. WDYT?
   
   Thanks for your review! I agree with  the page offset but feel it might be 
different topic.  The new `buffer_size` parameter in this PR is to allow users 
to customize buffer size for different column chunks, while currently all 
column chunks  have to share the same buffer size from `read_properties`.   On 
users side, they used to set a single buffer size for `read_properties`, with 
this PR they can further choose to customize a buffer size for specific column 
chunk.  E.g., if the user wants to limit the buffer memory to 64Mb when reading 
two chunks with 10MB and 100MB size, they now can assign 10Mb for the smaller 
chunk and the rest 54Mb for the larger one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jp0317 commented on pull request #36510: PARQUET-2321: [C++] allow customized buffer size when creating ArrowInputStream for a column PageReader

Reply via email to