Re: [C++] Potential cache/memory leak when reading parquet

Li Jin Wed, 06 Sep 2023 14:44:21 -0700

Correction:

> I tried with both Antione's suggestions (swapping the default allocator
and calls ReleaseUnused but neither seem to affect the max rss.


Calling ReleaseUnused does have some effect on the rss - the max rss goes
from ~6G -> 5G but still there seems to be something else.

On Wed, Sep 6, 2023 at 4:35 PM Li Jin <ice.xell...@gmail.com> wrote:

> Also attaching my experiment code just in case:
> https://gist.github.com/icexelloss/88195de046962e1d043c99d96e1b8b43
>
> On Wed, Sep 6, 2023 at 4:29 PM Li Jin <ice.xell...@gmail.com> wrote:
>
>> Reporting back with some new findings.
>>
>> Re Felipe and Antione:
>> I tried with both Antione's suggestions (swapping the default allocator
>> and calls ReleaseUnused but neither seem to affect the max rss. In
>> addition, I manage to repro the issue by reading a list of n local parquet
>> files that point to the same file, i.e., {"a.parquet", "a.parquet", ... }.
>> I am also able to crash my process by reading and passing a large enough n.
>> (I observed rss keep going up and eventually the process gets killed). This
>> observation led me to think there might actually be some memory leak issues.
>>
>> Re Xuwei:
>> Thanks for the tips. I am gonna try to memorize this profile next and see
>> what I can find.
>>
>> I am gonna keep looking into this but again, any ideas / suggestions are
>> appreciated (and thanks for all the help so far!)
>>
>> Li
>>
>>
>>
>>
>>
>>
>> On Wed, Sep 6, 2023 at 1:59 PM Li Jin <ice.xell...@gmail.com> wrote:
>>
>>> Thanks all for the additional suggestions. Will try it but want to
>>> answer Antoine's question first:
>>>
>>> > Which leads to the question: what is your OS?
>>>
>>> I am testing this on Debian 5.4.228 x86_64 GNU/Linux
>>>
>>> On Wed, Sep 6, 2023 at 1:31 PM wish maple <maplewish...@gmail.com>
>>> wrote:
>>>
>>>> By the way, you can try to use a memory-profiler like [1] and [2] .
>>>> It would be help to find how the memory is used
>>>>
>>>> Best,
>>>> Xuwei Fu
>>>>
>>>> [1]
>>>> https://github.com/jemalloc/jemalloc/wiki/Use-Case%3A-Heap-Profiling
>>>> [2] https://google.github.io/tcmalloc/gperftools.html
>>>>
>>>>
>>>> Felipe Oliveira Carvalho <felipe...@gmail.com> 于2023年9月7日周四 00:28写道：
>>>>
>>>> > > (a) stays pretty stable throughout the scan (stays < 1G), (b) keeps
>>>> > increasing during the scan (looks linear to the number of files
>>>> scanned).
>>>> >
>>>> > I wouldn't take this to mean a memory leak but the memory allocator
>>>> not
>>>> > paging out virtual memory that has been allocated throughout the scan.
>>>> > Could you run your workload under a memory profiler?
>>>> >
>>>> > (3) Scan the same dataset twice in the same process doesn't increase
>>>> the
>>>> > max rss.
>>>> >
>>>> > Another sign this isn't a leak, just the allocator reaching a level of
>>>> > memory commitment that it doesn't feel like undoing.
>>>> >
>>>> > --
>>>> > Felipe
>>>> >
>>>> > On Wed, Sep 6, 2023 at 12:56 PM Li Jin <ice.xell...@gmail.com> wrote:
>>>> >
>>>> > > Hello,
>>>> > >
>>>> > > I have been testing "What is the max rss needed to scan through
>>>> ~100G of
>>>> > > data in a parquet stored in gcs using Arrow C++".
>>>> > >
>>>> > > The current answer is about ~6G of memory which seems a bit high so
>>>> I
>>>> > > looked into it. What I observed during the process led me to think
>>>> that
>>>> > > there are some potential cache/memory issues in the dataset/parquet
>>>> cpp
>>>> > > code.
>>>> > >
>>>> > > Main observation:
>>>> > > (1) As I am scanning through the dataset, I printed out (a) memory
>>>> > > allocated by the memory pool from ScanOptions (b) process rss. I
>>>> found
>>>> > that
>>>> > > while (a) stays pretty stable throughout the scan (stays < 1G), (b)
>>>> keeps
>>>> > > increasing during the scan (looks linear to the number of files
>>>> scanned).
>>>> > > (2) I tested ScanNode in Arrow as well as an in-house library that
>>>> > > implements its own "S3Dataset" similar to Arrow dataset, both
>>>> showing
>>>> > > similar rss usage. (Which led me to think the issue is more likely
>>>> to be
>>>> > in
>>>> > > the parquet cpp code instead of dataset code).
>>>> > > (3) Scan the same dataset twice in the same process doesn't
>>>> increase the
>>>> > > max rss.
>>>> > >
>>>> > > I plan to look into the parquet cpp/dataset code but I wonder if
>>>> someone
>>>> > > has some clues what the issue might be or where to look at?
>>>> > >
>>>> > > Thanks,
>>>> > > Li
>>>> > >
>>>> >
>>>>
>>>

Re: [C++] Potential cache/memory leak when reading parquet

Reply via email to