On Thu, Feb 6, 2020, 12:42 PM Antoine Pitrou <anto...@python.org> wrote:
> > Le 06/02/2020 à 19:40, Antoine Pitrou a écrit : > > > > Le 06/02/2020 à 19:37, Wes McKinney a écrit : > >> On Thu, Feb 6, 2020, 12:12 PM Antoine Pitrou <anto...@python.org> > wrote: > >> > >>> Le 06/02/2020 à 16:26, Wes McKinney a écrit : > >>>> > >>>> This seems useful, too. It becomes a question of where do you want to > >>>> manage the cached memory segments, however you obtain them. I'm > >>>> arguing that we should not have much custom code in the Parquet > >>>> library to manage the prefetched segments (and providing the correct > >>>> buffer slice to each column reader when they need it), and instead > >>>> encapsulate this logic so it can be reused. > >>> > >>> I see, so RandomAccessFile would have some associative caching logic to > >>> find whether the exact requested range was cached and then return it to > >>> the caller? That sounds doable. How is lifetime handled then? Are > >>> cached buffers kept on the RandomAccessFile until they are requested, > at > >>> which point their ownership is transferred to the caller? > >>> > >> > >> This seems like too much to try to build into RandomAccessFile. I would > >> suggest a class that wraps a random access file and manages cached > segments > >> and their lifetimes through explicit APIs. > > > > So Parquet would expect to receive that class rather than > > RandomAccessFile? Or it would grow separate paths for it? > > Actually, on a more high-level basis, is the goal to prefetch for > sequential consumption of row groups? > Essentially yes. One "easy" optimization is to prefetch the entire serialized row group. This is an evolution of that idea where we want to prefetch only the needed parts of a row group in a minimum number of IO calls (consider reading the first 10 columns from a file with 1000 columns -- so we want to do one IO call instead of 10 like we do now). > Regards > > Antoine. >