Hello,
I need to process several large (~40 GB) files. np.memmap seems ideal for
this, but I have run into a problem that looks like a memory leak or memory
fragmentation. The following code illustrates the problem

import numpy as np

x = np.memmap('mybigfile.bin',mode='r',dtype='uint8')
print x.shape   # prints (42940071360,) in my case
ndat = x.shape[0]
for k in range(1000):
  y = x[k*ndat/1000:(k+1)*ndat/1000].astype('float32')  #The astype ensures
that the data is read in from disk
  del y


One would expect such a program would have a roughly constant memory
footprint, but in fact 'top' shows that the RES memory continually
increases. I can see that the memory usage is actually occurring because the
OS eventually starts to swap to disk. The memory usage does not seem to
correspond with the total size of the file.

Has anyone see this behavior? Is there a solution? I found this article:
http://pushingtheweb.com/2010/06/python-and-tcmalloc/ which sounds similar,
but it seems that the ~40 MB chunks I am loading would be using mmap anyway
so could be returned to the OS.

I am using nearly the latest version of numpy from the git repository
(np.__version__ returns 2.0.0.dev-Unknown), Python 2.7.1, and RHEL 5 on
x86_64.

I appreciate any suggestions.
Thanks,
Glenn
_______________________________________________
NumPy-Discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to