On Thu, Feb 6, 2020, 12:41 PM Antoine Pitrou <anto...@python.org> wrote:
> > Le 06/02/2020 à 19:37, Wes McKinney a écrit : > > On Thu, Feb 6, 2020, 12:12 PM Antoine Pitrou <anto...@python.org> wrote: > > > >> Le 06/02/2020 à 16:26, Wes McKinney a écrit : > >>> > >>> This seems useful, too. It becomes a question of where do you want to > >>> manage the cached memory segments, however you obtain them. I'm > >>> arguing that we should not have much custom code in the Parquet > >>> library to manage the prefetched segments (and providing the correct > >>> buffer slice to each column reader when they need it), and instead > >>> encapsulate this logic so it can be reused. > >> > >> I see, so RandomAccessFile would have some associative caching logic to > >> find whether the exact requested range was cached and then return it to > >> the caller? That sounds doable. How is lifetime handled then? Are > >> cached buffers kept on the RandomAccessFile until they are requested, at > >> which point their ownership is transferred to the caller? > >> > > > > This seems like too much to try to build into RandomAccessFile. I would > > suggest a class that wraps a random access file and manages cached > segments > > and their lifetimes through explicit APIs. > > So Parquet would expect to receive that class rather than > RandomAccessFile? Or it would grow separate paths for it? > If the user opts in to coalesced prefetching then the RowGroupReader would instantiate the wrapper under the hood. Public APIs (aside from new APIs in ReaderProperties for prefetching) would be unchanged. > > > > Regards > > Antoine. >