Hi Claire,

I think you read it correctly. Your proposal sounds good to me but you need
to make it a separate way of reading instead of rewriting the current
behavior. The current implementation figures out the consecutive parts in
the file (multiple pages or even column chunks written after each other)
and reads them in one attempt. This way the I/O is faster. Meanwhile, your
concerns are also completely valid so reading the pages lazily as they are
needed saves memory. It should be up to the API client to choose between
the solutions.
Since we already have the interfaces that we can hide our logic behind
(PageReadStore/PageReader), probably the best way would be introducing an
additional configuration that allows lazy reading behind the scenes.

Cheers,
Gabor

Claire McGinty <[email protected]> ezt írta (időpont: 2024. márc.
4., H, 21:04):

> Hi all,
>
> I had a question about memory usage in ParquetFileReader, particularly in
> #readNextRowGroup
> <
> https://github.com/apache/parquet-mr/blob/apache-parquet-1.13.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L937
> >
> /#readNextFilteredRowGroup
> <
> https://github.com/apache/parquet-mr/blob/apache-parquet-1.13.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1076
> >.
> From what I can tell, these methods will enumerate all column chunks in
> the row group, then for each chunk, fully read all pages in the chunk.
>
> I've been encountering memory issues performing heavy reads of Parquet
> data, particularly use cases that require the colocation of multiple
> Parquet files on a single worker. In cases like these, a single worker may
> be reading dozens or hundreds of Parquet files, and trying to materialize
> row groups is causing OOMs, even with tweaked row group size.
>
> I'm wondering if there's any way to avoid materializing the entire row
> group at once, and instead materialize pages on an as-needed basis (along
> with dictionary encoding etc when we start on a new chunk). Looking through
> the ParquetFileReader code, a possible solution could be to re-implement
> pagesInChunk
> <
> https://github.com/apache/parquet-mr/blob/apache-parquet-1.13.1/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1552
> >
> as
> an Iterator<DataPage> rather than a List<DataPage>, and modify
> ColumnChunkPageReader to support a lazy Collection of data pages?
>
> Let me know what you think! It's possible that I'm misunderstanding how
> readNextRowGroup works -- Parquet internals are a steep learning curve :)
>
> Best,
> Claire
>

Reply via email to