Correction: > I tried with both Antione's suggestions (swapping the default allocator and calls ReleaseUnused but neither seem to affect the max rss.
Calling ReleaseUnused does have some effect on the rss - the max rss goes from ~6G -> 5G but still there seems to be something else. On Wed, Sep 6, 2023 at 4:35 PM Li Jin <ice.xell...@gmail.com> wrote: > Also attaching my experiment code just in case: > https://gist.github.com/icexelloss/88195de046962e1d043c99d96e1b8b43 > > On Wed, Sep 6, 2023 at 4:29 PM Li Jin <ice.xell...@gmail.com> wrote: > >> Reporting back with some new findings. >> >> Re Felipe and Antione: >> I tried with both Antione's suggestions (swapping the default allocator >> and calls ReleaseUnused but neither seem to affect the max rss. In >> addition, I manage to repro the issue by reading a list of n local parquet >> files that point to the same file, i.e., {"a.parquet", "a.parquet", ... }. >> I am also able to crash my process by reading and passing a large enough n. >> (I observed rss keep going up and eventually the process gets killed). This >> observation led me to think there might actually be some memory leak issues. >> >> Re Xuwei: >> Thanks for the tips. I am gonna try to memorize this profile next and see >> what I can find. >> >> I am gonna keep looking into this but again, any ideas / suggestions are >> appreciated (and thanks for all the help so far!) >> >> Li >> >> >> >> >> >> >> On Wed, Sep 6, 2023 at 1:59 PM Li Jin <ice.xell...@gmail.com> wrote: >> >>> Thanks all for the additional suggestions. Will try it but want to >>> answer Antoine's question first: >>> >>> > Which leads to the question: what is your OS? >>> >>> I am testing this on Debian 5.4.228 x86_64 GNU/Linux >>> >>> On Wed, Sep 6, 2023 at 1:31 PM wish maple <maplewish...@gmail.com> >>> wrote: >>> >>>> By the way, you can try to use a memory-profiler like [1] and [2] . >>>> It would be help to find how the memory is used >>>> >>>> Best, >>>> Xuwei Fu >>>> >>>> [1] >>>> https://github.com/jemalloc/jemalloc/wiki/Use-Case%3A-Heap-Profiling >>>> [2] https://google.github.io/tcmalloc/gperftools.html >>>> >>>> >>>> Felipe Oliveira Carvalho <felipe...@gmail.com> 于2023年9月7日周四 00:28写道: >>>> >>>> > > (a) stays pretty stable throughout the scan (stays < 1G), (b) keeps >>>> > increasing during the scan (looks linear to the number of files >>>> scanned). >>>> > >>>> > I wouldn't take this to mean a memory leak but the memory allocator >>>> not >>>> > paging out virtual memory that has been allocated throughout the scan. >>>> > Could you run your workload under a memory profiler? >>>> > >>>> > (3) Scan the same dataset twice in the same process doesn't increase >>>> the >>>> > max rss. >>>> > >>>> > Another sign this isn't a leak, just the allocator reaching a level of >>>> > memory commitment that it doesn't feel like undoing. >>>> > >>>> > -- >>>> > Felipe >>>> > >>>> > On Wed, Sep 6, 2023 at 12:56 PM Li Jin <ice.xell...@gmail.com> wrote: >>>> > >>>> > > Hello, >>>> > > >>>> > > I have been testing "What is the max rss needed to scan through >>>> ~100G of >>>> > > data in a parquet stored in gcs using Arrow C++". >>>> > > >>>> > > The current answer is about ~6G of memory which seems a bit high so >>>> I >>>> > > looked into it. What I observed during the process led me to think >>>> that >>>> > > there are some potential cache/memory issues in the dataset/parquet >>>> cpp >>>> > > code. >>>> > > >>>> > > Main observation: >>>> > > (1) As I am scanning through the dataset, I printed out (a) memory >>>> > > allocated by the memory pool from ScanOptions (b) process rss. I >>>> found >>>> > that >>>> > > while (a) stays pretty stable throughout the scan (stays < 1G), (b) >>>> keeps >>>> > > increasing during the scan (looks linear to the number of files >>>> scanned). >>>> > > (2) I tested ScanNode in Arrow as well as an in-house library that >>>> > > implements its own "S3Dataset" similar to Arrow dataset, both >>>> showing >>>> > > similar rss usage. (Which led me to think the issue is more likely >>>> to be >>>> > in >>>> > > the parquet cpp code instead of dataset code). >>>> > > (3) Scan the same dataset twice in the same process doesn't >>>> increase the >>>> > > max rss. >>>> > > >>>> > > I plan to look into the parquet cpp/dataset code but I wonder if >>>> someone >>>> > > has some clues what the issue might be or where to look at? >>>> > > >>>> > > Thanks, >>>> > > Li >>>> > > >>>> > >>>> >>>