zeroshade commented on issue #278:
URL: https://github.com/apache/arrow-go/issues/278#issuecomment-2657493933

   The BufferSize matters because it controls the underlying `BufferedReader` 
that is being read from, (see 
https://github.com/apache/arrow-go/blob/main/internal/utils/buf_reader.go#L72). 
When the page reader attempts to read from S3 for the page header and data, the 
underlying `BufferedReader` will first read `BufferSize` data from S3 to fill 
the buffer and the buffers that are created/allocated are merely copying from 
that internal buffer.
   
   > In the current implementation, what I have observed is that, for those 
pages, it will require 1 read for the page header, and another read for the 
content (sometimes this doesn't happen if reading the page header covers the 
content). I was thinking that, since S3 has high latency, it would be more 
efficient to support a mechanism of reading 2 or more pages at the same time, 
given that memory will fit.
   
   Assuming the `BufferSize` is large enough and `BufferedStreamEnabled` is 
true, it should read multiple pages at once for you. Another option would be to 
enable providing a read-cache that you could pre-populate by using the 
OffsetIndex and PageLocations if you know which pages you want. That said, 
right now the current APIs don't make it easy (or really possible) to have the 
page reader skip entire pages or read specific pages and take advantage of such 
a read-cache. So In addition to the read-cache I'd have to add changes to make 
that more possible.
   
   Does that make sense? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to