Hi Claire,

I think it would be better to continue the discussion in a related jira or
even PR.

Cheers,
Gabor

Claire McGinty <[email protected]> ezt írta (időpont: 2024. márc.
5., K, 14:09):

> Great, makes sense Gabor!
>
> Perhaps this could even be implemented via an Integer Configuration value
> for how many pages, or page bytes, to buffer at a time, so that users can
> balance IO speed with memory usage. I'll try out a few approaches and aim
> to update this thread when I have something.
>
> Best,
> Claire
>
>
>
> On Tue, Mar 5, 2024 at 2:55 AM Gábor Szádovszky <[email protected]> wrote:
>
> > Hi Claire,
> >
> > I think you read it correctly. Your proposal sounds good to me but you
> need
> > to make it a separate way of reading instead of rewriting the current
> > behavior. The current implementation figures out the consecutive parts in
> > the file (multiple pages or even column chunks written after each other)
> > and reads them in one attempt. This way the I/O is faster. Meanwhile,
> your
> > concerns are also completely valid so reading the pages lazily as they
> are
> > needed saves memory. It should be up to the API client to choose between
> > the solutions.
> > Since we already have the interfaces that we can hide our logic behind
> > (PageReadStore/PageReader), probably the best way would be introducing an
> > additional configuration that allows lazy reading behind the scenes.
> >
> > Cheers,
> > Gabor
> >
> > Claire McGinty <[email protected]> ezt írta (időpont: 2024.
> márc.
> > 4., H, 21:04):
> >
> > > Hi all,
> > >
> > > I had a question about memory usage in ParquetFileReader, particularly
> in
> > > #readNextRowGroup
> > > <
> > >
> >
> https://github.com/apache/parquet-mr/blob/apache-parquet-1.13.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L937
> > > >
> > > /#readNextFilteredRowGroup
> > > <
> > >
> >
> https://github.com/apache/parquet-mr/blob/apache-parquet-1.13.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1076
> > > >.
> > > From what I can tell, these methods will enumerate all column chunks in
> > > the row group, then for each chunk, fully read all pages in the chunk.
> > >
> > > I've been encountering memory issues performing heavy reads of Parquet
> > > data, particularly use cases that require the colocation of multiple
> > > Parquet files on a single worker. In cases like these, a single worker
> > may
> > > be reading dozens or hundreds of Parquet files, and trying to
> materialize
> > > row groups is causing OOMs, even with tweaked row group size.
> > >
> > > I'm wondering if there's any way to avoid materializing the entire row
> > > group at once, and instead materialize pages on an as-needed basis
> (along
> > > with dictionary encoding etc when we start on a new chunk). Looking
> > through
> > > the ParquetFileReader code, a possible solution could be to
> re-implement
> > > pagesInChunk
> > > <
> > >
> >
> https://github.com/apache/parquet-mr/blob/apache-parquet-1.13.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1552
> > > >
> > > as
> > > an Iterator<DataPage> rather than a List<DataPage>, and modify
> > > ColumnChunkPageReader to support a lazy Collection of data pages?
> > >
> > > Let me know what you think! It's possible that I'm misunderstanding how
> > > readNextRowGroup works -- Parquet internals are a steep learning curve
> :)
> > >
> > > Best,
> > > Claire
> > >
> >
>

Reply via email to