Re: [Numpy-discussion] np.memmap and memory usage

Francesc Alted Wed, 01 Jul 2009 06:04:32 -0700

A Wednesday 01 July 2009 10:17:51 Emmanuelle Gouillart escrigué:
>       Hello,
>
>       I'm using numpy.memmap to open big 3-D arrays of Xray tomography
> data. After I have created a new array using memmap, I modify the
> contrast of every Z-slice (along the first dimension) inside a for loop,
> for a better visualization of the data. Although I call memmap.flush
> after each modification of a Z-slice, the memory used by Ipython keeps
> increasing at every new iteration. At the end of the loop, the memory
> used by Ipython is of the order of magnitude of the size of the data file
> (1.8Go !). I would have expected that the maximum amount of memory used
> would corresponde to only one Z-slice of the 3D array. See the code
> snapshots below for more details.
>
> Is this an expected behaviour?


I think so, yes.  As all the file (1.8 GB) has to be processed, the OS will 
need to load everything in RAM at least once.  The memory containing already 
processed data is not 'freed' by the OS even if you try to flush() the slices 
(this just force memory data to be saved on-disk), but rather when the OS 
needs it again (but still, they will be swapped out to disk, effectively 
consuming resources).  I'm afraid that the only way to actually 'free' the 
memory is to close() the *complete* dataset.

> How can I reduce the amount of
> memory used by Ipython and still process my data?

There are a number of interfaces that can deal with binary data on-disk so 
that they can read&write slices of data easily.  By using this approach, you 
only have to load the appropriate slice, operate and write it again.  That's 
similar to how numpy.memmap works, but it uses far less memory (just the 
loaded slice).

For example, by using the PyTables interface [1], you could do:

        f = tb.openFile(filename+".h5", "r+")
        data = f.root.data
        for nrow, sl in enumerate(data):
            sl[sl<imin] = imin
            sl[sl>imax] = imax
            data[nrow] = sl
        f.close()

which is similar to your memmap-based approach:

        data = np.memmap(filename+".bin", mode='r+', dtype='f4', shape=shape)
        for sl in data:
            sl[sl<imin] = imin
            sl[sl>imax] = imax

just that it takes far less memory (47 MB vs 1.9 GB) and besides, it has no 
other limitation than your available space on disk (compare this with the 
limit of your virtual memory when using memmap).  The speeds are similar too:

Using numpy.memmap                                                            
Time creating data file: 8.534                                                
Time processing data file: 32.742                                             

Using tables                                                                  
Time creating data file: 2.88                                                 
Time processing data file: 32.615                                             

However, you can still speed-up out-of-core computations by using the recently 
introduced tables.Expr class (PyTables 2.2b1, see [2]), which uses a 
combination of the Numexpr [3] and PyTables advanced computing capabilities:

        f = tb.openFile(filename+".h5", "r+")
        data = f.root.data
        expr = tb.Expr("where(data<imin, imin, data)")
        expr.setOutput(data)
        expr.eval()
        expr = tb.Expr("where(data>imax, imax, data)")
        expr.setOutput(data)
        expr.eval()
        f.close()

and the timings for this venue are:

Using tables.Expr                                                             
Time creating data file: 2.393                                                
Time processing data file: 18.25                                              

which is around a 75% faster than a pure memmap/PyTables approach.

Further, if your data is compressible, you can probably achieve additional 
speed-ups by using a fast compressor (like LZO, which is supported by PyTables 
right out-of-the-box).

I'm attaching the script I've used for producing the above timings.  You may 
find it useful for trying this out against your own data.

[1] http://www.pytables.org
[2] http://www.pytables.org/download/preliminary/
[3] http://code.google.com/p/numexpr/

HTH,

-- 
Francesc Alted

import sys
from time import time
import numpy as np
import tables as tb

shape=(512, 981, 981)
filename = "/scratch2/faltet/data-vol"


def create_data(kind):
    if kind == "np":
        data = np.memmap(filename+".bin", mode='w+', dtype='f4', shape=shape)
        for nrow in range(len(data)):
            data[nrow] = nrow
    else:
        f = tb.openFile(filename+".h5", "w")
        data = f.createCArray(f.root, 'data', tb.Float32Atom(), shape=shape)
        for nrow in range(len(data)):
            data[nrow] = nrow
        f.close()


def process_data(kind):
    imin, imax = -2, 2
    if kind == "np":
        data = np.memmap(filename+".bin", mode='r+', dtype='f4', shape=shape)
        for sl in data:
            sl[sl<imin] = imin
            sl[sl>imax] = imax
        #print "data (numpy)-->", data
    elif kind == "tb":
        f = tb.openFile(filename+".h5", "r+")
        data = f.root.data
        for nrow, sl in enumerate(data):
            sl[sl<imin] = imin
            sl[sl>imax] = imax
            data[nrow] = sl
        #print "data (tables) -->", data[:]
        f.close()
    else:
        f = tb.openFile(filename+".h5", "r+")
        data = f.root.data
        expr = tb.Expr("where(data<imin, imin, data)")
        expr.setOutput(data)
        expr.eval()
        expr = tb.Expr("where(data>imax, imax, data)")
        expr.setOutput(data)
        expr.eval()
        #print "data (tables.Expr)-->", data[:]
        f.close()


if __name__ == '__main__':
    if len(sys.argv) > 1:
        kind = sys.argv[1]
    else:
        kind = "np"

    if kind == "np":
        print "Using numpy.memmap"
    elif kind == "tb":
        print "Using tables"
    else:
        print "Using tables.Expr"

    t0 = time()
    create_data(kind)
    print "Time creating data file:", round(time()-t0, 3)
    t0 = time()
    process_data(kind)
    print "Time processing data file:", round(time()-t0, 3)

_______________________________________________
Numpy-discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] np.memmap and memory usage

Reply via email to