Thanks both for the quick response! I wonder if there is some code in
parquet cpp  that might be keeping some cached information (perhaps
metadata) per file scanned?

On Wed, Sep 6, 2023 at 12:10 PM wish maple <maplewish...@gmail.com> wrote:

> I've met lots of Parquet Dataset issues. The main problem is that currently
> we have 2 sets or API
> and they have different scan-options. And sometimes different interfaces
> like `to_batches()` or
> others would enable different scan options.
>
> I think [2] is similar to your problem. 1-4 are some issues I met before.
>
> As for the code, you may take a look at :
> 1. ParquetFileFormat and Dataset related.
> 2. FileSystem and CacheRange. Parquet might use this to handle pre-buffer
> 3. How Parquet RowReader handle IO
>
> [1] https://github.com/apache/arrow/issues/36765
> [2] https://github.com/apache/arrow/issues/37139
> [3] https://github.com/apache/arrow/issues/36587
> [4] https://github.com/apache/arrow/issues/37136
>
> Li Jin <ice.xell...@gmail.com> 于2023年9月6日周三 23:56写道:
>
> > Hello,
> >
> > I have been testing "What is the max rss needed to scan through ~100G of
> > data in a parquet stored in gcs using Arrow C++".
> >
> > The current answer is about ~6G of memory which seems a bit high so I
> > looked into it. What I observed during the process led me to think that
> > there are some potential cache/memory issues in the dataset/parquet cpp
> > code.
> >
> > Main observation:
> > (1) As I am scanning through the dataset, I printed out (a) memory
> > allocated by the memory pool from ScanOptions (b) process rss. I found
> that
> > while (a) stays pretty stable throughout the scan (stays < 1G), (b) keeps
> > increasing during the scan (looks linear to the number of files scanned).
> > (2) I tested ScanNode in Arrow as well as an in-house library that
> > implements its own "S3Dataset" similar to Arrow dataset, both showing
> > similar rss usage. (Which led me to think the issue is more likely to be
> in
> > the parquet cpp code instead of dataset code).
> > (3) Scan the same dataset twice in the same process doesn't increase the
> > max rss.
> >
> > I plan to look into the parquet cpp/dataset code but I wonder if someone
> > has some clues what the issue might be or where to look at?
> >
> > Thanks,
> > Li
> >
>

Reply via email to