Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread Gang Wu
As suggested from other comments, I also highly recommend using a heap profiling tool to investigate what's going on there. BTW, 800 columns look suspicious to me. Could you try to test them without reading any batch? Not sure if the file metadata is the root cause. Or you may want to try another

Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread Li Jin
Correction: > I tried with both Antione's suggestions (swapping the default allocator and calls ReleaseUnused but neither seem to affect the max rss. Calling ReleaseUnused does have some effect on the rss - the max rss goes from ~6G -> 5G but still there seems to be something else. On Wed, Sep 6

Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread Li Jin
Also attaching my experiment code just in case: https://gist.github.com/icexelloss/88195de046962e1d043c99d96e1b8b43 On Wed, Sep 6, 2023 at 4:29 PM Li Jin wrote: > Reporting back with some new findings. > > Re Felipe and Antione: > I tried with both Antione's suggestions (swapping the default all

Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread Li Jin
Reporting back with some new findings. Re Felipe and Antione: I tried with both Antione's suggestions (swapping the default allocator and calls ReleaseUnused but neither seem to affect the max rss. In addition, I manage to repro the issue by reading a list of n local parquet files that point to th

Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread Li Jin
Thanks all for the additional suggestions. Will try it but want to answer Antoine's question first: > Which leads to the question: what is your OS? I am testing this on Debian 5.4.228 x86_64 GNU/Linux On Wed, Sep 6, 2023 at 1:31 PM wish maple wrote: > By the way, you can try to use a memory-pr

Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread wish maple
By the way, you can try to use a memory-profiler like [1] and [2] . It would be help to find how the memory is used Best, Xuwei Fu [1] https://github.com/jemalloc/jemalloc/wiki/Use-Case%3A-Heap-Profiling [2] https://google.github.io/tcmalloc/gperftools.html Felipe Oliveira Carvalho 于2023年9月7日周

Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread Felipe Oliveira Carvalho
> (a) stays pretty stable throughout the scan (stays < 1G), (b) keeps increasing during the scan (looks linear to the number of files scanned). I wouldn't take this to mean a memory leak but the memory allocator not paging out virtual memory that has been allocated throughout the scan. Could you r

Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread Li Jin
> Hi Jin, > Do you have more information about the parquet file? This is metadata for one file (I scanned about 2000 files in total) created_by: parquet-mr version 1.12.3 (build f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b) num_columns: 840 num_rows: 87382 num_row_groups: 1 format_ve

Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread wish maple
1. In dataset, it might have `fragment_readahead` or other. 2. In Parquet, if prebuffer is enabled, it will prebuffer some column ( See `FileReaderImpl::GetRecordBatchReader`) 3. In Parquet, if non-buffered read is enabled, when read a column, the whole ColumChunk would be read. Otherwise, it w

Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread Li Jin
Thanks both for the quick response! I wonder if there is some code in parquet cpp that might be keeping some cached information (perhaps metadata) per file scanned? On Wed, Sep 6, 2023 at 12:10 PM wish maple wrote: > I've met lots of Parquet Dataset issues. The main problem is that currently >

Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread Antoine Pitrou
Hi Li, Le 06/09/2023 à 17:55, Li Jin a écrit : Hello, I have been testing "What is the max rss needed to scan through ~100G of data in a parquet stored in gcs using Arrow C++". The current answer is about ~6G of memory which seems a bit high so I looked into it. What I observed during the pr

Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread wish maple
I've met lots of Parquet Dataset issues. The main problem is that currently we have 2 sets or API and they have different scan-options. And sometimes different interfaces like `to_batches()` or others would enable different scan options. I think [2] is similar to your problem. 1-4 are some issues

Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread Gang Wu
Hi Jin, Do you have more information about the parquet file? What came to my mind is this issue: https://github.com/apache/arrow/issues/35393 If you have observed something, please feel free to create a new issue and post what you have found there. Thanks, Gang On Wed, Sep 6, 2023 at 11:56 PM Li

[C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread Li Jin
Hello, I have been testing "What is the max rss needed to scan through ~100G of data in a parquet stored in gcs using Arrow C++". The current answer is about ~6G of memory which seems a bit high so I looked into it. What I observed during the process led me to think that there are some potential

Arrow R Package Development Sync Call - Thursday 7th September

2023-09-06 Thread Nic Crane
The fortnightly Arrow R package dev community call is on Thursday 7th September at 16:30 UTC (12:30 ET). Joining instructions are below. Video call link: https://meet.google.com/dbm-ybmv-evb Phone numbers: https://tel.meet/dbm-ybmv-evb?pin=9199558189233 The meeting notes can be found here; pleas

Re: [ACCOUNCE] New Arrow Committer: Metehan Yildirim

2023-09-06 Thread Kevin Gurney
Congratulations Metehan! From: Mustafa Akur Sent: Wednesday, September 6, 2023 1:56 AM To: dev@arrow.apache.org Subject: Re: [ACCOUNCE] New Arrow Committer: Metehan Yildirim Congrats Mete! On Wed, Sep 6, 2023 at 7:19 AM Alenka Frim wrote: > Congratulations Met