Hi Alex,
I think you're misinterpreting the results. Yes, the RSS memory (as
reported by memory_profiler) doesn't seem to decrease. No, it doesn't
mean that Arrow doesn't release memory. It's actually common for memory
allocators (such as jemalloc, or the system allocator) to keep
deallocated pages around, because asking the kernel to recycle them is
expensive.
Unless your system is running low on memory, you shouldn't care about
this. Trying to return memory to the kernel can actually make
performance worse if you ask Arrow to allocate memory soon after.
That said, you can try to call MemoryPool.release_unused() if these
numbers are important to you:
https://arrow.apache.org/docs/python/generated/pyarrow.MemoryPool.html#pyarrow.MemoryPool.release_unused
Regards
Antoine.
Le 15/06/2023 à 17:39, Jerald Alex a écrit :
Hi Experts,
I have come across the memory pool configurations using an environment
variable *ARROW_DEFAULT_MEMORY_POOL* and I tried to make use of them and
test it.
I could observe improvements on macOS with the *system* memory pool but no
change on linux os. I have captured more details on GH issue
https://github.com/apache/arrow/issues/36100... If any one can highlight or
suggest a way to overcome this problem will be helpful. Appreciate your
help.!
Regards,
Alex
On Wed, Jun 14, 2023 at 9:35 PM Jerald Alex <vminf...@gmail.com> wrote:
Hi Experts,
Pyarrow *Table.from_pylist* does not release memory until the program
terminates. I created a sample script to highlight the issue. I have also
tried setting up `pa.jemalloc_set_decay_ms(0)` but it didn't help much.
Could you please check this and let me know if there are potential issues /
any workaround to resolve this?
pyarrow.__version__
'12.0.0'
OS Details:
OS: macOS 13.4 (22F66)
Kernel Version: Darwin 22.5.0
Sample code to reproduce. (it needs memory_profiler)
#file_name: test_exec.py
import pyarrow as pa
import time
import random
import string
from memory_profiler import profile
def get_sample_data():
record1 = {}
for col_id in range(15):
record1[f"column_{col_id}"] = string.ascii_letters[10 :
random.randint(17, 49)]
return [record1]
def construct_data(data):
count = 1
while count < 10:
pa.Table.from_pylist(data * 100000)
count += 1
return True
@profile
def main():
data = get_sample_data()
construct_data(data)
print("construct data completed!")
if __name__ == "__main__":
main()
time.sleep(600)
memory_profiler output:
Filename: test_exec.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
41 65.6 MiB 65.6 MiB 1 @profile
42 def main():
43 65.6 MiB 0.0 MiB 1 data = get_sample_data()
44 203.8 MiB 138.2 MiB 1 construct_data(data)
45 203.8 MiB 0.0 MiB 1 print("construct data
completed!")
Regards,
Alex