Hi,

I just asked this question on the IRC channel but didn't manage to get a response, though some people replied with suggestions that expanded this question a bit.

I have a program that has to read some pickle files, perform some operations on them, and then return. The pickle objects I am reading all have the same structure, which consists of a single list with two elements: the first one is a long list, the second one is a numpy object.

I found out that, after calling that function, the memory taken by the Python executable (monitored using htop -- the entire thing runs on Python 3.6 on an Ubuntu 16.04, pretty standard conda installation with a few packages installed directly using `conda install`) increases in proportion to the size of the pickle object being read. My intuition is that that memory should be free upon exiting.

Does pickle keep a cache of objects in memory after they have been returned? I thought that could be the answer, but then someone suggested to measure the time it takes to load the objects. This is a script I wrote to test this; nothing(filepath) just loads the pickle file, doesn't do anything with the output and returns how long it took to perform the load operation.

---
import glob
import pickle
import timeit
import os
import psutil

def nothing(filepath):
   start = timeit.default_timer()
   with open(filepath, 'rb') as f:
       _ = pickle.load(f)
   return timeit.default_timer() - start

if __name__ == "__main__":

   filelist = glob.glob('/tmp/test/*.pk')

   for i, filepath in enumerate(filelist):
       print("Size of file {}: {}".format(i, os.path.getsize(filepath)))
       print("First call:", nothing(filepath))
       print("Second call:", nothing(filepath))
       print("Memory usage:", psutil.Process(os.getpid()).memory_info().rss)
       print()
---

This is the output of the second time the script was run, to avoid any effects of potential IO caches:

---
Size of file 0: 11280531
First call: 0.1466723980847746
Second call: 0.10044755204580724
Memory usage: 49418240

Size of file 1: 8955825
First call: 0.07904054620303214
Second call: 0.07996074995025992
Memory usage: 49831936

Size of file 2: 43727266
First call: 0.37741047400049865
Second call: 0.38176894187927246
Memory usage: 49758208

Size of file 3: 31122090
First call: 0.271301960805431
Second call: 0.27462846506386995
Memory usage: 49991680

Size of file 4: 634456686
First call: 5.526095286011696
Second call: 5.558765463065356
Memory usage: 539324416

Size of file 5: 3349952658
First call: 29.50982437795028
Second call: 29.461691531119868
Memory usage: 3443597312

Size of file 6: 9384929
First call: 0.0826977719552815
Second call: 0.08362263604067266
Memory usage: 3443597312

Size of file 7: 422137
First call: 0.0057482069823890924
Second call: 0.005949910031631589
Memory usage: 3443597312

Size of file 8: 409458799
First call: 3.562588643981144
Second call: 3.6001368327997625
Memory usage: 3441451008

Size of file 9: 44843816
First call: 0.39132978999987245
Second call: 0.398518088972196
Memory usage: 3441451008
---

Notice that memory usage increases noticeably specially on files 4 and 5, the biggest ones, and doesn't come down as I would expect it to. But the loading time is constant, so I think I can disregard any pickle caching mechanisms.

So I guess now my question is: can anyone give me any pointers as to why is this happening? Any help is appreciated.

Thanks,

--
José María (Chema) Mateos || https://rinzewind.org/
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to