[jira] [Comment Edited] (ARROW-14987) [C++]Memory leak while reading parquet file

Weston Pace (Jira) Mon, 06 Dec 2021 19:43:05 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-14987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454354#comment-17454354
 ]


Weston Pace edited comment on ARROW-14987 at 12/7/21, 3:42 AM:
---------------------------------------------------------------

*TL:DR; A chunk_size of 3 is way too low.*

Thank you so much for the detailed reproduction.

h3. Some notes

First, I used 5 times the amount of data that you were working with.  This 
works out to 12.5MB of int64_t "data"

Second, you are not releasing the variable named "table" in your main method.  
This holds on to 12.5MB of RAM.  I added table.reset() before the sleep to take 
care of this.

Third, a chunk size of 3 is pathologically small. This means parquet is going 
to have to write row group metadata after every 3 rows of data.  As a result, 
the parquet file, which only contains 12.5MB of real data, requires 169MB.  
This means there is ~157MB of metadata.  A chunk size should, at a minimum, be 
in the tens of thousands, and often is in the millions.

*When I run this test I end up with nearly 1GB of memory usage!  Even given the 
erroneously large parquet file this seems like way too much*

h3. Figuring out Arrow memory pool usage

One helpful tool when determining how much RAM Arrow is using is to print out 
how many bytes Arrow thinks it is holding onto.  To do this you can add...

{noformat}
std::cout << arrow::default_memory_pool()->bytes_allocated() << " 
bytes_allocated" << std::endl;
{noformat}

Assuming you add the "table.reset()" call this should print "0 bytes allocated" 
which means that Arrow is not holding on to any memory.

The second common thing to get blamed is jemalloc.  Arrow uses jemalloc (or 
possibly mimalloc) internally in its memory pools and these allocators 
sometimes over-allocate and sometimes hold onto memory for a little while.  
However, this seems unlikely because jemalloc is configured by default by Arrow 
to release over-allocated memory every 1 second.

To verify I built an instrumented version of Arrow to print stats for its 
internal jemalloc pool after 5 seconds of being idle.  I got:

{noformat}
Allocated: 29000, active: 45056, metadata: 6581448 (n_thp 0), resident: 
6606848, mapped: 12627968, retained: 125259776
{noformat}

This means Arrow has 29KB of data actively allocated (this is curious, given 
bytes_allocated is 0, and worth investigation at a later date, but certainly 
not the culprit here).

That 29KB of active data spans 45.056KB of pages (this is what people refer to 
when they talk about fragmentation).  There is also 6.58MB of jemalloc 
metadata.  I'm pretty sure this is rather independent of the workload and not 
something to worry too much about.

Combined, this 45.056KB of data and 6.58MB of metadata is occupying 6.61MB of 
RSS.  So far so good.

h3. Figuring out the rest of the memory usage

There is only one other place the remaining memory usage can be, which is the 
application's global system allocator.  To debug this further I built my test 
application with jemalloc (a different jemalloc instance than the one running 
Arrow).  This means Arrow's memory pool will use one instance of jemalloc and 
everything else will use my own instance of jemalloc.  Printing stats I get:

{noformat}
Allocated: 257904, active: 569344, metadata: 15162288 (n_thp 0), resident: 
950906880, mapped: 958836736, retained: 648630272
{noformat}

Now we have found our culprit.  There is about 258KB allocated and it occupied 
569KB worth of pages and 15MB of jemalloc metadata.  This is pretty reasonable 
and makes sense (this is memory used by shared pointers and various metadata 
objects.  It seems pretty appropriate.

_However, this ~15MB of data is occupying nearly 1GB of RSS!_

To debug further I used jemalloc's memory profiling to track where all of these 
allocations were happening.  It turns out most of these allocations were in the 
parquet reader itself.  While the table built will eventually be constructed in 
Arrow's memory pool the parquet reader does not use the memory pool for the 
various allocations needed to operate the reader itself.

So, putting this all together into a hypothesis...

The chunk size of 3 means we have a ton of metadata.  This metadata gets 
allocated by the parquet reader in lots of very small allocations.  These 
allocations have terrible fragmentation and the system allocator ends up 
scattering this information across a wide swath of RSS and results in a large 
amount of over-allocation.

h3. Fixes

h4. Fix 1: Use more jemalloc

Since my test was already using jemalloc I can configure jemalloc the same way 
Arrow does by enabling the background thread and setting it to purge on a 1 
second interval.  Now, running my test, after 5 seconds of inactivity I get the 
following from the global jemalloc:

{noformat}
Allocated: 246608, active: 544768, metadata: 15155760 (n_thp 0), resident: 
15675392, mapped: 23613440, retained: 1382526976
{noformat}

We now see that same ~15MB of data and jemalloc metadata is now spread across 
15.6MB of RSS (pretty great fragmentation support).  I can confirm this by 
looking at the RSS of the process which reports 25MB (most of which is 
explained by the two jemalloc instance's metadata) which is a massive 
improvement over 1GB.

h4. Fix 2: Use a sane chunk size

If I change the chunk size to 100,000 then suddenly parquet is not making so 
many tiny allocations (my program runs much faster) and I get the following 
stats for the global jemalloc instance:

{noformat}
Allocated: 1756168, active: 2027520, metadata: 4492600 (n_thp 0), resident: 
6496256, mapped: 8318976, retained: 64557056
{noformat}

And I see only 18.5MB of RSS usage.


was (Author: westonpace):
*TL:DR; A chunk_size of 3 is way too low.*

Thank you so much for the detailed reproduction.

# Some notes

First, I used 5 times the amount of data that you were working with.  This 
works out to 12.5MB of int64_t "data"

Second, you are not releasing the variable named "table" in your main method.  
This holds on to 12.5MB of RAM.  I added table.reset() before the sleep to take 
care of this.

Third, a chunk size of 3 is pathologically small. This means parquet is going 
to have to write row group metadata after every 3 rows of data.  As a result, 
the parquet file, which only contains 12.5MB of real data, requires 169MB.  
This means there is ~157MB of metadata.  A chunk size should, at a minimum, be 
in the tens of thousands, and often is in the millions.

*When I run this test I end up with nearly 1GB of memory usage!  Even given the 
erroneously large parquet file this seems like way too much*

# Figuring out Arrow memory pool usage

One helpful tool when determining how much RAM Arrow is using is to print out 
how many bytes Arrow thinks it is holding onto.  To do this you can add...

{noformat}
std::cout << arrow::default_memory_pool()->bytes_allocated() << " 
bytes_allocated" << std::endl;
{noformat}

Assuming you add the "table.reset()" call this should print "0 bytes allocated" 
which means that Arrow is not holding on to any memory.

The second common thing to get blamed is jemalloc.  Arrow uses jemalloc (or 
possibly mimalloc) internally in its memory pools and these allocators 
sometimes over-allocate and sometimes hold onto memory for a little while.  
However, this seems unlikely because jemalloc is configured by default by Arrow 
to release over-allocated memory every 1 second.

To verify I built an instrumented version of Arrow to print stats for its 
internal jemalloc pool after 5 seconds of being idle.  I got:

{noformat}
Allocated: 29000, active: 45056, metadata: 6581448 (n_thp 0), resident: 
6606848, mapped: 12627968, retained: 125259776
{noformat}

This means Arrow has 29KB of data actively allocated (this is curious, given 
bytes_allocated is 0, and worth investigation at a later date, but certainly 
not the culprit here).

That 29KB of active data spans 45.056KB of pages (this is what people refer to 
when they talk about fragmentation).  There is also 6.58MB of jemalloc 
metadata.  I'm pretty sure this is rather independent of the workload and not 
something to worry too much about.

Combined, this 45.056KB of data and 6.58MB of metadata is occupying 6.61MB of 
RSS.  So far so good.

# Figuring out the rest of the memory usage

There is only one other place the remaining memory usage can be, which is the 
application's global system allocator.  To debug this further I built my test 
application with jemalloc (a different jemalloc instance than the one running 
Arrow).  This means Arrow's memory pool will use one instance of jemalloc and 
everything else will use my own instance of jemalloc.  Printing stats I get:

{noformat}
Allocated: 257904, active: 569344, metadata: 15162288 (n_thp 0), resident: 
950906880, mapped: 958836736, retained: 648630272
{noformat}

Now we have found our culprit.  There is about 258KB allocated and it occupied 
569KB worth of pages and 15MB of jemalloc metadata.  This is pretty reasonable 
and makes sense (this is memory used by shared pointers and various metadata 
objects.  It seems pretty appropriate.

_However, this ~15MB of data is occupying nearly 1GB of RSS!_

To debug further I used jemalloc's memory profiling to track where all of these 
allocations were happening.  It turns out most of these allocations were in the 
parquet reader itself.  While the table built will eventually be constructed in 
Arrow's memory pool the parquet reader does not use the memory pool for the 
various allocations needed to operate the reader itself.

So, putting this all together into a hypothesis...

The chunk size of 3 means we have a ton of metadata.  This metadata gets 
allocated by the parquet reader in lots of very small allocations.  These 
allocations have terrible fragmentation and the system allocator ends up 
scattering this information across a wide swath of RSS and results in a large 
amount of over-allocation.

# Fixes

## Fix 1: Use more jemalloc

Since my test was already using jemalloc I can configure jemalloc the same way 
Arrow does by enabling the background thread and setting it to purge on a 1 
second interval.  Now, running my test, after 5 seconds of inactivity I get the 
following from the global jemalloc:

{noformat}
Allocated: 246608, active: 544768, metadata: 15155760 (n_thp 0), resident: 
15675392, mapped: 23613440, retained: 1382526976
{noformat}

We now see that same ~15MB of data and jemalloc metadata is now spread across 
15.6MB of RSS (pretty great fragmentation support).  I can confirm this by 
looking at the RSS of the process which reports 25MB (most of which is 
explained by the two jemalloc instance's metadata) which is a massive 
improvement over 1GB.

## Fix 2: Use a sane chunk size

If I change the chunk size to 100,000 then suddenly parquet is not making so 
many tiny allocations (my program runs much faster) and I get the following 
stats for the global jemalloc instance:

{noformat}
Allocated: 1756168, active: 2027520, metadata: 4492600 (n_thp 0), resident: 
6496256, mapped: 8318976, retained: 64557056
{noformat}

And I see only 18.5MB of RSS usage.

> [C++]Memory leak while reading parquet file
> -------------------------------------------
>
>                 Key: ARROW-14987
>                 URL: https://issues.apache.org/jira/browse/ARROW-14987
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 6.0.1
>            Reporter: Qingxiang Chen
>            Priority: Major
>
> When I used parquet to access data, I found that the memory usage was still 
> high after the function ended. I reproduced this problem in the example. code 
> show as below:
>  
> {code:c++}
> #include <arrow/api.h>
> #include <arrow/io/api.h>
> #include <parquet/arrow/reader.h>
> #include <parquet/arrow/writer.h>
> #include <parquet/exception.h>
> #include <unistd.h>
> #include <iostream>
> std::shared_ptr<arrow::Table> generate_table() {
>   arrow::Int64Builder i64builder;
>   for (int i=0;i<320000;i++){
>         i64builder.Append(i);
>   }
>   std::shared_ptr<arrow::Array> i64array;
>   PARQUET_THROW_NOT_OK(i64builder.Finish(&i64array));
>   std::shared_ptr<arrow::Schema> schema = arrow::schema(
>       {arrow::field("int", arrow::int64())});
>   return arrow::Table::Make(schema, {i64array});
> }
> void write_parquet_file(const arrow::Table& table) {
>   std::shared_ptr<arrow::io::FileOutputStream> outfile;
>   PARQUET_ASSIGN_OR_THROW(
>       outfile, 
> arrow::io::FileOutputStream::Open("parquet-arrow-example.parquet"));
>   PARQUET_THROW_NOT_OK(
>       parquet::arrow::WriteTable(table, arrow::default_memory_pool(), 
> outfile, 3));
> }
> void read_whole_file() {
>   std::cout << "Reading parquet-arrow-example.parquet at once" << std::endl;
>   std::shared_ptr<arrow::io::ReadableFile> infile;
>   PARQUET_ASSIGN_OR_THROW(infile,
>                           
> arrow::io::ReadableFile::Open("parquet-arrow-example.parquet",
>                                                         
> arrow::default_memory_pool()));
>   std::unique_ptr<parquet::arrow::FileReader> reader;
>   PARQUET_THROW_NOT_OK(
>       parquet::arrow::OpenFile(infile, arrow::default_memory_pool(), 
> &reader));
>   std::shared_ptr<arrow::Table> table;
>   PARQUET_THROW_NOT_OK(reader->ReadTable(&table));
>   std::cout << "Loaded " << table->num_rows() << " rows in " << 
> table->num_columns()
>             << " columns." << std::endl;
> }
> int main(int argc, char** argv) {
>   std::shared_ptr<arrow::Table> table = generate_table();
>   write_parquet_file(*table);
>   std::cout << "start " <<std::endl;
>   read_whole_file();
>   std::cout << "end " <<std::endl;
>   sleep(100);
> }
> {code}
> After the end, during sleep, the memory usage is still more than 100M and has 
> not dropped. When I increase the data volume by 5 times, the memory usage is 
> about 500M, and it will not drop.
> I want to know whether this part of the data is cached by the memory pool, or 
> whether it is a memory leak problem. If there is no memory leak, how to set 
> memory pool size or release memory?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (ARROW-14987) [C++]Memory leak while reading parquet file

Reply via email to