Re: [I] How to best optimize reading from S3? [arrow-go]

via GitHub Thu, 13 Feb 2025 10:40:29 -0800


stevbear commented on issue #278:
URL: https://github.com/apache/arrow-go/issues/278#issuecomment-2657436043


   Hi @zeroshade, thanks for getting back to me.
   The OOM is not my concern at the moment, as I was able to get away with 
using BufferSize and specifying a reasonable batch size. My concern is more on 
efficiency. The BufferSize inside of ReaderProperties doesn't matter that much 
If I'm reading the source code correctly, as it creates a new buffer when 
reading the page data 
(https://github.com/apache/arrow-go/blob/main/parquet/file/page_reader.go#L543).
   We have a not so usual use case of using parquet files for point reads, 
meaning that, on each read, we will only be reading a few hundred rows, that is 
in one, or 2 - 3 consecutive pages. Looking at the implementation, I will be 
able to use column indexes and page indexes to narrow down the pages I will 
need. In the current implementation, what I have observed is that, for those 
pages, it will require 1 read for the page header, and another read for the 
content (sometimes this doesn't happen if reading the page header covers the 
content). I was thinking that, since S3 has high latency, it would be more 
efficient to support a mechanism of reading 2 or more pages at the same time, 
given that memory will fit.
   Thanks again!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] How to best optimize reading from S3? [arrow-go]

Reply via email to