Re: [Numpy-discussion] np.memmap and memory usage

Francesc Alted Wed, 01 Jul 2009 08:23:06 -0700

A Wednesday 01 July 2009 15:04:08 Francesc Alted escrigué:
> However, you can still speed-up out-of-core computations by using the
> recently introduced tables.Expr class (PyTables 2.2b1, see [2]), which uses
> a combination of the Numexpr [3] and PyTables advanced computing
> capabilities:
>
>         f = tb.openFile(filename+".h5", "r+")
>         data = f.root.data
>         expr = tb.Expr("where(data<imin, imin, data)")
>         expr.setOutput(data)
>         expr.eval()
>         expr = tb.Expr("where(data>imax, imax, data)")
>         expr.setOutput(data)
>         expr.eval()
>         f.close()
>
> and the timings for this venue are:
>
> Using tables.Expr
> Time creating data file: 2.393
> Time processing data file: 18.25
>
> which is around a 75% faster than a pure memmap/PyTables approach.


Ops, I suddenly realized that the above can be further accelerated by 
combining both expressions into a nested one.  Something like:

        f = tb.openFile(filename+".h5", "r+")
        data = f.root.data
        # Complex expression that spawns several lines follows
        expr = tb.Expr("""
where(
    where(data<imin, imin, data)>imax,
    imax, data)
""")
        expr.setOutput(data)
        expr.eval()
        f.close()

With this change, the computation time is now:

Using tables.Expr
Time creating data file: 2.18
Time processing data file: 10.992

which represents another 65% of improvement over the version using two 
expressions (and 3x faster than the numpy.memmap version).

> Further, if your data is compressible, you can probably achieve additional
> speed-ups by using a fast compressor (like LZO, which is supported by
> PyTables right out-of-the-box).

As I was curious, I've tried activating the LZO compressor.  Here are the 
results:

Using tables.Expr
Time creating data file: 3.123
Time processing data file: 12.533

Mmh, contrarily to my expectations, this hasn't accelerated the computations.  
My guess is that data being very simple and synthetic, the compression ratio 
is very high (200x), forcing the compressor/uncompressor to do a lot of work 
here.  However, with real-life data the speed could effectively improve.  
OTOH, using a faster compressor could be very advantageous here too :)

Cheers,

-- 
Francesc Alted

import sys
from time import time
import numpy as np
import tables as tb

shape=(512, 981, 981)
filename = "/scratch2/faltet/data-vol"

# Choose the filter that you prefer
#filters = None            # No compression
#filters = tb.Filters(complib="gzip", complevel=1, shuffle=True)
#filters = tb.Filters(complib="lzo", complevel=1, shuffle=True)
filters = tb.Filters(complib="lzo", complevel=1, shuffle=False)


def create_data(kind):
    if kind == "np":
        data = np.memmap(filename+".bin", mode='w+', dtype='f4', shape=shape)
        for nrow in range(len(data)):
            data[nrow] = nrow
    else:
        f = tb.openFile(filename+".h5", "w")
        data = f.createCArray(f.root, 'data', tb.Float32Atom(), shape=shape,
                              filters=filters)
        for nrow in range(len(data)):
            data[nrow] = nrow
        f.close()


def process_data(kind):
    imin, imax = -2, 2
    if kind == "np":
        data = np.memmap(filename+".bin", mode='r+', dtype='f4', shape=shape)
        for sl in data:
            sl[sl<imin] = imin
            sl[sl>imax] = imax
        #print "data (numpy)-->", data
    elif kind == "tb":
        f = tb.openFile(filename+".h5", "r+")
        data = f.root.data
        for nrow, sl in enumerate(data):
            sl[sl<imin] = imin
            sl[sl>imax] = imax
            data[nrow] = sl
        #print "data (tables) -->", data[:]
        f.close()
    else:
        f = tb.openFile(filename+".h5", "r+")
        data = f.root.data
        # Complex expression that spawns several lines follows
        expr = tb.Expr("""
where(
    where(data<imin, imin, data)>imax,
    imax, data)
""")
        expr.setOutput(data)
        expr.eval()
        #print "data (tables.Expr)-->", data[:]
        f.close()


if __name__ == '__main__':
    if len(sys.argv) > 1:
        kind = sys.argv[1]
    else:
        kind = "np"

    if kind == "np":
        print "Using numpy.memmap"
    elif kind == "tb":
        print "Using tables"
    else:
        print "Using tables.Expr"

    t0 = time()
    create_data(kind)
    print "Time creating data file:", round(time()-t0, 3)
    t0 = time()
    process_data(kind)
    print "Time processing data file:", round(time()-t0, 3)

_______________________________________________
Numpy-discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] np.memmap and memory usage

Reply via email to