The problem in brief: Why does it take 20-40 seconds to extract a table column
of 200000 integers? The code snippet in question is:
with pt.openFile(filename) as f:
vlarrayrow = f.root.gp.cols.vlarrayrow[:]
== Background ==
I have an HDF5 file (let's call it file "P") containing parameter scenarios for
simulations. It is about 1 GB size and about 200000 records in each of four
tables. Another HDF5 file (file "O") holds 40 GB of simulation output, but not
in the same order as in file P. The output row number for each parameter
scenario is stored in a column of a table in file P. However, extracting just
that column is disappointingly slow. I have profiled the following example
script, output is below [1].
== prof_pt.py ==
"""Run with: python -m cProfile prof_pt.py"""
from __future__ import with_statement
import tables as pt
filename = "myfile.py.h5"
with pt.openFile(filename) as f:
vlarrayrow = f.root.gp.cols.vlarrayrow[:]
A lot of the time seems to be in "method '_fillCol' of 'tableExtension.Row'
objects", file.py and __init__.py (I don't know what the latter two are).
== Reproducible example ==
I've made an example script to replicate the problem without reference to the
details of my actual use case. I hypothesized that cols.x[:] was slow because
it had to skip (or perhaps even read, *shudder*) the remainder of every record
in the table. However, it seems there is a critical file size at which the
problem manifests.
The example makes a table with "nrow" records of the following description,
where "othersize" sets the amount of other stuff in the table.
class result(pt.IsDescription):
otherdata = pt.IntCol(shape=othersize)
i = pt.IntCol()
Then I time the statement:
i = f.root.t.cols.i[:]
The code below [2] produces the following output:
-bash-3.2$ python profcols.py
...
INFO:root:4.74983906746 seconds, (nrow,othersize=50000,2000)
INFO:root:0.610723972321 seconds, (nrow,othersize=52000,2000)
INFO:root:0.568428993225 seconds, (nrow,othersize=54000,2000)
INFO:root:4.76227211952 seconds, (nrow,othersize=56000,2000)
INFO:root:5.63922095299 seconds, (nrow,othersize=58000,2000)
INFO:root:20.6964008808 seconds, (nrow,othersize=50000,4000)
INFO:root:19.6613388062 seconds, (nrow,othersize=52000,4000)
INFO:root:18.8091700077 seconds, (nrow,othersize=54000,4000)
... (I stopped it here; the file was getting to 800 MB)
For comparable sizes of the table, getting "i" sometimes takes about half a
second, sometimes ten times as much. Doubling the record size brings the access
time up to forty times as much.
(If you want to replicate this, you may have to fiddle a bit with nrow and
othersize to find the threshold on your system; at least I did.)
Thank you in advance for any advice.
Best regards,
Jon Olav Vik
[1]
== Profile output by cumulative time (excerpt) ==
ncalls tottime percall cumtime percall filename:lineno(function)
1 0 0 108.265 108.265 <string>:1(<module>)
1 0.029 0.029 108.265 108.265 {execfile}
1 0.277 0.277 108.236 108.236 prof_pt.py:1(<module>)
1 5.746 5.746 58.949 58.949 __init__.py:53(<module>)
1 0 0 48.686 48.686 table.py:2822(__getitem__)
1 0 0 48.685 48.685 table.py:1439(_read)
1 0 0 48.685 48.685 table.py:1496(read)
1 48.565 48.565 48.685 48.685 {method '_fillCol' of
'tableExtension.Row' objects}
1 3.954 3.954 29.907 29.907 __init__.py:63(<module>)
2 2.413 1.207 18.816 9.408 __init__.py:1(<module>)
1 9.371 9.371 17.339 17.339 file.py:35(<module>)
5 4.795 0.959 16.513 3.303 __init__.py:2(<module>)
1 0.022 0.022 15.749 15.749 add_newdocs.py:9(<module>)
1 0.486 0.486 10.239 10.239 type_check.py:3(<module>)
== Profile output by total time (excerpt) ==
ncalls tottime percall cumtime percall filename:lineno(function)
1 48.6 48.565 48.685 48.685 {method '_fillCol' of
'tableExtension.Row' objects}
1 9.4 9.371 17.339 17.339 file.py:35(<module>)
1 5.7 5.746 58.949 58.949 __init__.py:53(<module>)
5 4.8 0.959 16.513 3.303 __init__.py:2(<module>)
1 4.0 3.954 29.907 29.907 __init__.py:63(<module>)
1 3.7 3.681 3.689 3.689 Numeric.py:85(<module>)
2 2.6 1.291 2.589 1.294 __init__.py:8(<module>)
2 2.4 1.207 18.816 9.408 __init__.py:1(<module>)
1 2.0 1.983 1.983 1.983 weakref.py:6(<module>)
1 1.9 1.939 1.939 1.939 arrayprint.py:4(<module>)
1 1.7 1.664 3.679 3.679 numpytest.py:1(<module>)
1 1.5 1.528 3.469 3.469 numeric.py:1(<module>)
1 1.5 1.45 1.45 1.45 linalg.py:10(<module>)
1 1.4 1.424 2.836 2.836 urllib2.py:74(<module>)
1 1.3 1.312 2.78 2.78 utils.py:1(<module>)
1 1.2 1.202 1.204 1.204 expressions.py:1(<module>)
1 1.1 1.147 1.15 1.15 group.py:30(<module>)
1 1.1 1.111 4.801 4.801 flavor.py:43(<module>)
1 1.0 1.027 1.115 1.115 httplib.py:67(<module>)
1 1.0 1.025 1.717 1.717 utils.py:3(<module>)
[2]
== profcols.py ==
from __future__ import with_statement
import tables as pt
import numpy as np
import os
import time
import logging
logging.basicConfig(level=logging.INFO) # change to logging.DEBUG for details
filename = "profcols.h5"
chunksize=25000
def descr(othersize):
class result(pt.IsDescription):
otherdata = pt.IntCol(shape=othersize)
i = pt.IntCol()
return result
def createfile(nrow, othersize):
if os.path.exists(filename):
os.remove(filename)
with pt.openFile(filename, "w") as f:
t = f.createTable(f.root, "t", descr(othersize), expectedrows=nrow)
d, m = divmod(nrow, chunksize)
for i in range(d):
logging.debug("Appending %s records" % chunksize)
t.append(np.ones(chunksize, t._v_dtype))
logging.debug("Appending %s records" % m)
t.append(np.ones(m, t._v_dtype))
def timecols(nrow=None, othersize=None):
if nrow:
createfile(nrow, othersize)
logging.debug("Accessing column (nrow,othersize=%s,%s)" % (nrow, othersize))
t0 = time.time()
with pt.openFile(filename) as f:
i = f.root.t.cols.i[:]
t1 = time.time()
logging.info("%s seconds, (nrow,othersize=%s,%s)"
% (t1-t0, nrow, othersize))
return t1-t0
print timecols(50000, 2000)
sec = [[timecols(nrow, othersize) for nrow in range(50000, 60000, 2000)]
for othersize in range(2000, 2500, 200)]
print sec
------------------------------------------------------------------------------
Enter the BlackBerry Developer Challenge
This is your chance to win up to $100,000 in prizes! For a limited time,
vendors submitting new applications to BlackBerry App World(TM) will have
the opportunity to enter the BlackBerry Developer Challenge. See full prize
details at: http://p.sf.net/sfu/blackberry
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users