Re: [I] [Parquet] >2GiB Memory Leak on reading single parquet metadata file [arrow]

via GitHub Fri, 15 Nov 2024 09:32:38 -0800


jonded94 commented on issue #44599:
URL: https://github.com/apache/arrow/issues/44599#issuecomment-2479532080

As all of this is unfortunately an internal project, I'm unable to share any
specifics, especially files. I'm very sorry!
Maybe I can find some time to create synthetic files which show a similar
behaviour.

> it would be nice if you could compare the memory consumption with PyArrow
and with parquet-rs.

I can do that with a similar workload that does not read *metadata*, but
actual rows. This unfortunately is not strictly connected to the original issue
mentioned here, I'm sorry! But it nevertheless could be interesting if I share
our results of that. Sorry that this issue is so meandering.

This workload just consumed `RecordBatch` instances of size 2250 with
`pyarrow.Dataset.Fragment.to_batches`
(https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Fragment.html#pyarrow.dataset.Fragment.to_batches).
Nothing was done with them. They were just immediately discarded, very similar
to the original script I shared.

![image](https://github.com/user-attachments/assets/b33accbf-d122-4428-aaeb-72a08dbc0368)

As one can see, memory is just steadily increasing. Running this script in
parallel will very easily lead to OOM errors (which we also could reproduce).

Then we tried creating an entirely new memory pool for every data iteration
as a workaround:

![image](https://github.com/user-attachments/assets/254aa104-8b00-40d8-a439-1014b2ffccc6)
This helped, but the memory load still is quite high.

When using the Rust reader for the same workload, memory load seemed to
behave much more predictibly and was quite a bit lower:

![image](https://github.com/user-attachments/assets/bbce1260-ade7-45dd-af3c-016b54b5e028)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Parquet] >2GiB Memory Leak on reading single parquet metadata file [arrow]

Reply via email to