Re: [I] [C++] Metadata related memory leak when reading parquet dataset [arrow]

via GitHub Wed, 29 Jan 2025 11:46:52 -0800


timothydijamco commented on issue #45287:
URL: https://github.com/apache/arrow/issues/45287#issuecomment-2622689751


   > `ReleaseUnused` is best effort, so you can't really deduce this 
unfortunately. The new https://github.com/apache/arrow/pull/45359 might allow 
you to get a better idea, though the allocator stats are not always easy to 
understand.
   
   I see, that's fair.
   
   ---
   
   > Adding `physical_schema_.reset()` to the `ClearCachedMetadata()` method 
(from https://github.com/apache/arrow/pull/45330) seems to reduce memory usage 
a bit further
   
   I did some memory profiling on a version of Arrow with 
`physical_schema_.reset()` and I notice that memory usage actually looks 
bounded now. 
   
   Here's the memory usage graph of a C++ program that scans that "250 files, 
10k columns, 200-character-long column names" dataset twice:
   | Clearing `metadata_`, `manifest_`, `original_metadata_` | Clearing 
`metadata_`, `manifest_`, `original_metadata_`, **`physical_schema_`** |
   |------|---------|
   | <img width="1002" alt="Image" 
src="https://github.com/user-attachments/assets/ec750ab1-fab1-4712-97f3-13cabb0d06f5";
 /> | <img width="1000" alt="Image" 
src="https://github.com/user-attachments/assets/48280ac2-7dcf-49a1-898c-c7712d7378b6";
 /> |
   
   
   And for good measure, here's the same thing but on a dataset with twice as 
many files (from 250 files -> 500 files) to show memory accumulation better:
   | Clearing `metadata_`, `manifest_`, `original_metadata_` | Clearing 
`metadata_`, `manifest_`, `original_metadata_`, **`physical_schema_`** |
   |------|---------|
   | <img width="1002" alt="Image" 
src="https://github.com/user-attachments/assets/64cf67a6-3a32-4d48-a321-104ac44017e8";
 /> | <img width="1002" alt="Image" 
src="https://github.com/user-attachments/assets/43484c53-57e8-4816-90ec-8a3d93005dc4";
 /> |
   
   
   Overall, clearing `metadata_`, `manifest_`, `original_metadata_`, and 
`physical_schema_` all together seems to do the trick of preventing 
metadata-related objects from accumulating over a scan. Going to test on some 
real datasets as well and see how they are affected.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [C++] Metadata related memory leak when reading parquet dataset [arrow]

Reply via email to