Hi Alex,

I think you're misinterpreting the results. Yes, the RSS memory (as reported by memory_profiler) doesn't seem to decrease. No, it doesn't mean that Arrow doesn't release memory. It's actually common for memory allocators (such as jemalloc, or the system allocator) to keep deallocated pages around, because asking the kernel to recycle them is expensive.

Unless your system is running low on memory, you shouldn't care about this. Trying to return memory to the kernel can actually make performance worse if you ask Arrow to allocate memory soon after.

That said, you can try to call MemoryPool.release_unused() if these numbers are important to you:
https://arrow.apache.org/docs/python/generated/pyarrow.MemoryPool.html#pyarrow.MemoryPool.release_unused

Regards

Antoine.



Le 15/06/2023 à 17:39, Jerald Alex a écrit :
Hi Experts,

I have come across the memory pool configurations using an environment
variable *ARROW_DEFAULT_MEMORY_POOL* and I tried to make use of them and
test it.

I could observe improvements on macOS with the *system* memory pool but no
change on linux os. I have captured more details on GH issue
https://github.com/apache/arrow/issues/36100... If any one can highlight or
suggest a way to overcome this problem will be helpful. Appreciate your
help.!

Regards,
Alex

On Wed, Jun 14, 2023 at 9:35 PM Jerald Alex <vminf...@gmail.com> wrote:

Hi Experts,

Pyarrow *Table.from_pylist* does not release memory until the program
terminates. I created a sample script to highlight the issue. I have also
tried setting up `pa.jemalloc_set_decay_ms(0)` but it didn't help much.
Could you please check this and let me know if there are potential issues /
any workaround to resolve this?

pyarrow.__version__
'12.0.0'

OS Details:
OS: macOS 13.4 (22F66)
Kernel Version: Darwin 22.5.0



Sample code to reproduce. (it needs memory_profiler)

#file_name: test_exec.py
import pyarrow as pa
import time
import random
import string

from memory_profiler import profile

def get_sample_data():
     record1 = {}
     for col_id in range(15):
         record1[f"column_{col_id}"] = string.ascii_letters[10 :
random.randint(17, 49)]

     return [record1]

def construct_data(data):
     count = 1
     while count < 10:
         pa.Table.from_pylist(data * 100000)
         count += 1
     return True

@profile
def main():
     data = get_sample_data()
     construct_data(data)
     print("construct data completed!")

if __name__ == "__main__":
     main()
     time.sleep(600)


memory_profiler output:

Filename: test_exec.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     41     65.6 MiB     65.6 MiB           1   @profile
     42                                         def main():
     43     65.6 MiB      0.0 MiB           1       data = get_sample_data()
     44    203.8 MiB    138.2 MiB           1       construct_data(data)
     45    203.8 MiB      0.0 MiB           1       print("construct data
completed!")

Regards,
Alex


Reply via email to