A Monday 13 December 2010 15:08:03 Francesc Alted escrigué:
> A Monday 13 December 2010 14:56:26 Dominik Szczerba escrigué:
> > > But, for knowing if accessing columns this is efficient for your
> > > case, I'd need more info on your datasets. Are they contiguous
> > > or chunked? If chunked, which is the chunkshape you have chosen?
> >
> > Both. Files saved from matlab are uncompressed/contiguous, the ones
> > saved from my program are usually compressed/chunked and the size
> > is around 1024^2/sizeof(type).
>
> Well, for PyTables (or any C application) and contiguous datasets,
> accessing data by columns is inefficient: the privileged direction
> for performance are rows.
I was curious to see the difference in performance. Here are some
timings:
>>> nptetra = np.empty((4, 4622544))
>>> f = tb.openFile("/tmp/t.h5", "w")
>>> tetra = f.createArray(f.root, "tetra", nptetra)
>>> %time [ tetra[:,i] for i in range(4622544) ]
CPU times: user 201.61 s, sys: 162.59 s, total: 364.20 s
Wall time: 367.91 s
Using the transposed version (i.e. accessing by rows):
>>> tetra2 = f.createArray(f.root, "tetra2", nptetra.transpose())
>>> %time [ tetra2[i] for i in range(4622544) ]
CPU times: user 163.78 s, sys: 0.48 s, total: 164.25 s
Wall time: 165.44 s # the time is more than 2x faster
But using the iterator is the fastest mode (the I/O is buffered):
>>> %time [ row for row in tetra2 ]
CPU times: user 26.21 s, sys: 0.38 s, total: 26.59 s
Wall time: 26.81 s
I'd say that for chunked datasets you can expect something similar.
--
Francesc Alted
------------------------------------------------------------------------------
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users