stevbear commented on issue #278: URL: https://github.com/apache/arrow-go/issues/278#issuecomment-2657436043
Hi @zeroshade, thanks for getting back to me. The OOM is not my concern at the moment, as I was able to get away with using BufferSize and specifying a reasonable batch size. My concern is more on efficiency. The BufferSize inside of ReaderProperties doesn't matter that much If I'm reading the source code correctly, as it creates a new buffer when reading the page data (https://github.com/apache/arrow-go/blob/main/parquet/file/page_reader.go#L543). We have a not so usual use case of using parquet files for point reads, meaning that, on each read, we will only be reading a few hundred rows, that is in one, or 2 - 3 consecutive pages. Looking at the implementation, I will be able to use column indexes and page indexes to narrow down the pages I will need. In the current implementation, what I have observed is that, for those pages, it will require 1 read for the page header, and another read for the content (sometimes this doesn't happen if reading the page header covers the content). I was thinking that, since S3 has high latency, it would be more efficient to support a mechanism of reading 2 or more pages at the same time, given that memory will fit. Thanks again! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
