zeroshade commented on issue #278: URL: https://github.com/apache/arrow-go/issues/278#issuecomment-2657493933
The BufferSize matters because it controls the underlying `BufferedReader` that is being read from, (see https://github.com/apache/arrow-go/blob/main/internal/utils/buf_reader.go#L72). When the page reader attempts to read from S3 for the page header and data, the underlying `BufferedReader` will first read `BufferSize` data from S3 to fill the buffer and the buffers that are created/allocated are merely copying from that internal buffer. > In the current implementation, what I have observed is that, for those pages, it will require 1 read for the page header, and another read for the content (sometimes this doesn't happen if reading the page header covers the content). I was thinking that, since S3 has high latency, it would be more efficient to support a mechanism of reading 2 or more pages at the same time, given that memory will fit. Assuming the `BufferSize` is large enough and `BufferedStreamEnabled` is true, it should read multiple pages at once for you. Another option would be to enable providing a read-cache that you could pre-populate by using the OffsetIndex and PageLocations if you know which pages you want. That said, right now the current APIs don't make it easy (or really possible) to have the page reader skip entire pages or read specific pages and take advantage of such a read-cache. So In addition to the read-cache I'd have to add changes to make that more possible. Does that make sense? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
