I have a Pytables 2.0.4 VLArray, called "y", with about 6500 rows of about 8500
atoms of shape (36,). The following line takes about 20 minutes to run:
i = sum(len(yi) for yi in y)
Question 1: Can I somehow access the length of a VLArray row without having to
read the entire row?
Question 2: Further on I only need to work with the last 20% or so of each row.
Is there an efficient way to slice from a row without having to load it all
from disk?
for i in range(len(y)):
yj = y[i][-2000:] # not having to read y[i][:6500]
...
Thanks in advance for any tips.
Regards,
Jon Olav
Background:
If y were a Numpy array in memory, the summing would be fast, because each
array object remembers its shape. For the VLArray in the HDF5 file, I realize
now that I need to read all the data to compute the total number of atoms.
That's 6500 * 8500 * 36 * 8 = 16 GB (meaning about 13 MB/s for 20 minutes).
>From the timing below (and watching "top" for ages), I see that the (len(yi)
for yi in y) spent almost all its time _waiting_ for disk access (status 'D' =
uninterruptible sleep, but the support staff tell me it means waiting for disk).
In [17]: time i = sum(len(yi) for yi in y)
CPU times: user 39.93 s, sys: 16.24 s, total: 56.16 s
Wall time: 1192.63
In [18]: y
Out[18]:
/ap/ph/y (VLArray(6561L,)) 'State vector'
atom = Float64Atom(shape=(36L,), dflt=0.0)
byteorder = 'little'
nrows = 6561
flavor = 'numpy'
In [20]: len(y[0])
Out[20]: 8977
In [23]: ls -l vlarraytest.h5
-rw-r--r-- 1 jonvi users 17377780785 ...
------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables unlimited
royalty-free distribution of the report engine for externally facing
server and web deployment.
http://p.sf.net/sfu/businessobjects
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users