As suggested from other comments, I also highly recommend using a
heap profiling tool to investigate what's going on there.
BTW, 800 columns look suspicious to me. Could you try to test them
without reading any batch? Not sure if the file metadata is the root
cause. Or you may want to try another
Correction:
> I tried with both Antione's suggestions (swapping the default allocator
and calls ReleaseUnused but neither seem to affect the max rss.
Calling ReleaseUnused does have some effect on the rss - the max rss goes
from ~6G -> 5G but still there seems to be something else.
On Wed, Sep 6
Also attaching my experiment code just in case:
https://gist.github.com/icexelloss/88195de046962e1d043c99d96e1b8b43
On Wed, Sep 6, 2023 at 4:29 PM Li Jin wrote:
> Reporting back with some new findings.
>
> Re Felipe and Antione:
> I tried with both Antione's suggestions (swapping the default all
Reporting back with some new findings.
Re Felipe and Antione:
I tried with both Antione's suggestions (swapping the default allocator and
calls ReleaseUnused but neither seem to affect the max rss. In addition, I
manage to repro the issue by reading a list of n local parquet files that
point to th
Thanks all for the additional suggestions. Will try it but want to answer
Antoine's question first:
> Which leads to the question: what is your OS?
I am testing this on Debian 5.4.228 x86_64 GNU/Linux
On Wed, Sep 6, 2023 at 1:31 PM wish maple wrote:
> By the way, you can try to use a memory-pr
By the way, you can try to use a memory-profiler like [1] and [2] .
It would be help to find how the memory is used
Best,
Xuwei Fu
[1] https://github.com/jemalloc/jemalloc/wiki/Use-Case%3A-Heap-Profiling
[2] https://google.github.io/tcmalloc/gperftools.html
Felipe Oliveira Carvalho 于2023年9月7日周
> (a) stays pretty stable throughout the scan (stays < 1G), (b) keeps
increasing during the scan (looks linear to the number of files scanned).
I wouldn't take this to mean a memory leak but the memory allocator not
paging out virtual memory that has been allocated throughout the scan.
Could you r
> Hi Jin,
> Do you have more information about the parquet file?
This is metadata for one file (I scanned about 2000 files in total)
created_by: parquet-mr version 1.12.3 (build
f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)
num_columns: 840
num_rows: 87382
num_row_groups: 1
format_ve
1. In dataset, it might have `fragment_readahead` or other.
2. In Parquet, if prebuffer is enabled, it will prebuffer some column ( See
`FileReaderImpl::GetRecordBatchReader`)
3. In Parquet, if non-buffered read is enabled, when read a column, the
whole ColumChunk would be read.
Otherwise, it w
Thanks both for the quick response! I wonder if there is some code in
parquet cpp that might be keeping some cached information (perhaps
metadata) per file scanned?
On Wed, Sep 6, 2023 at 12:10 PM wish maple wrote:
> I've met lots of Parquet Dataset issues. The main problem is that currently
>
Hi Li,
Le 06/09/2023 à 17:55, Li Jin a écrit :
Hello,
I have been testing "What is the max rss needed to scan through ~100G of
data in a parquet stored in gcs using Arrow C++".
The current answer is about ~6G of memory which seems a bit high so I
looked into it. What I observed during the pr
I've met lots of Parquet Dataset issues. The main problem is that currently
we have 2 sets or API
and they have different scan-options. And sometimes different interfaces
like `to_batches()` or
others would enable different scan options.
I think [2] is similar to your problem. 1-4 are some issues
Hi Jin,
Do you have more information about the parquet file? What came to
my mind is this issue: https://github.com/apache/arrow/issues/35393
If you have observed something, please feel free to create a new issue
and post what you have found there.
Thanks,
Gang
On Wed, Sep 6, 2023 at 11:56 PM Li
Hello,
I have been testing "What is the max rss needed to scan through ~100G of
data in a parquet stored in gcs using Arrow C++".
The current answer is about ~6G of memory which seems a bit high so I
looked into it. What I observed during the process led me to think that
there are some potential
The fortnightly Arrow R package dev community call is on Thursday 7th
September at 16:30 UTC (12:30 ET).
Joining instructions are below.
Video call link: https://meet.google.com/dbm-ybmv-evb
Phone numbers: https://tel.meet/dbm-ybmv-evb?pin=9199558189233
The meeting notes can be found here; pleas
Congratulations Metehan!
From: Mustafa Akur
Sent: Wednesday, September 6, 2023 1:56 AM
To: dev@arrow.apache.org
Subject: Re: [ACCOUNCE] New Arrow Committer: Metehan Yildirim
Congrats Mete!
On Wed, Sep 6, 2023 at 7:19 AM Alenka Frim
wrote:
> Congratulations Met
16 matches
Mail list logo