Note that you can ask pyarrow how much memory it thinks it is using with the pyarrow.total_allocated_bytes[1] function. This can be very useful for tracking memory leaks.
I see that memory-profiler now has support for different backends. Sadly, it doesn't look like you can register a custom backend. Might be a fun project if someone wanted to add a pyarrow backend for it :) [1] https://arrow.apache.org/docs/python/generated/pyarrow.total_allocated_bytes.html On Thu, Jun 15, 2023 at 9:16 AM Antoine Pitrou <anto...@python.org> wrote: > > Hi Alex, > > I think you're misinterpreting the results. Yes, the RSS memory (as > reported by memory_profiler) doesn't seem to decrease. No, it doesn't > mean that Arrow doesn't release memory. It's actually common for memory > allocators (such as jemalloc, or the system allocator) to keep > deallocated pages around, because asking the kernel to recycle them is > expensive. > > Unless your system is running low on memory, you shouldn't care about > this. Trying to return memory to the kernel can actually make > performance worse if you ask Arrow to allocate memory soon after. > > That said, you can try to call MemoryPool.release_unused() if these > numbers are important to you: > > https://arrow.apache.org/docs/python/generated/pyarrow.MemoryPool.html#pyarrow.MemoryPool.release_unused > > Regards > > Antoine. > > > > Le 15/06/2023 à 17:39, Jerald Alex a écrit : > > Hi Experts, > > > > I have come across the memory pool configurations using an environment > > variable *ARROW_DEFAULT_MEMORY_POOL* and I tried to make use of them and > > test it. > > > > I could observe improvements on macOS with the *system* memory pool but > no > > change on linux os. I have captured more details on GH issue > > https://github.com/apache/arrow/issues/36100... If any one can > highlight or > > suggest a way to overcome this problem will be helpful. Appreciate your > > help.! > > > > Regards, > > Alex > > > > On Wed, Jun 14, 2023 at 9:35 PM Jerald Alex <vminf...@gmail.com> wrote: > > > >> Hi Experts, > >> > >> Pyarrow *Table.from_pylist* does not release memory until the program > >> terminates. I created a sample script to highlight the issue. I have > also > >> tried setting up `pa.jemalloc_set_decay_ms(0)` but it didn't help much. > >> Could you please check this and let me know if there are potential > issues / > >> any workaround to resolve this? > >> > >>>>> pyarrow.__version__ > >> '12.0.0' > >> > >> OS Details: > >> OS: macOS 13.4 (22F66) > >> Kernel Version: Darwin 22.5.0 > >> > >> > >> > >> Sample code to reproduce. (it needs memory_profiler) > >> > >> #file_name: test_exec.py > >> import pyarrow as pa > >> import time > >> import random > >> import string > >> > >> from memory_profiler import profile > >> > >> def get_sample_data(): > >> record1 = {} > >> for col_id in range(15): > >> record1[f"column_{col_id}"] = string.ascii_letters[10 : > >> random.randint(17, 49)] > >> > >> return [record1] > >> > >> def construct_data(data): > >> count = 1 > >> while count < 10: > >> pa.Table.from_pylist(data * 100000) > >> count += 1 > >> return True > >> > >> @profile > >> def main(): > >> data = get_sample_data() > >> construct_data(data) > >> print("construct data completed!") > >> > >> if __name__ == "__main__": > >> main() > >> time.sleep(600) > >> > >> > >> memory_profiler output: > >> > >> Filename: test_exec.py > >> > >> Line # Mem usage Increment Occurrences Line Contents > >> ============================================================= > >> 41 65.6 MiB 65.6 MiB 1 @profile > >> 42 def main(): > >> 43 65.6 MiB 0.0 MiB 1 data = > get_sample_data() > >> 44 203.8 MiB 138.2 MiB 1 construct_data(data) > >> 45 203.8 MiB 0.0 MiB 1 print("construct > data > >> completed!") > >> > >> Regards, > >> Alex > >> > > >