Thanks again for providing PyTables, which is a big help in my research.
I am now dealing with files of tens of gigabytes which would probably be
unthinkable to manage without PyTables. But using such large files
brings some performance issues.
I have a 26 GiB PyTables file which has a hierarchy that looks somewhat
like this:
root
. 256 Groups
.. approx. 64 Arrays (16328 arrays total)
... 2000 x 105 float64s
Before any caching takes place, File.walkNodes(classname="Array") is
very slow, taking more than six minutes to yield the first result. This
is despite the fact that I can get a result from File.getNode() or
File.walkNodes(group, classname="Array") quickly. I wrote a quick script
to benchmark this. The script and cProfile results are below.
If I may hazard a guess before delving into the source too much, it
appears that File.walkNodes() is slow because it is a breadth-first
iteration. So it loads each of the toplevel groups into memory to check
if they are Arrays before moving onto Nodes lower in the hierarchy. Is
this right?
If so, is there a way to speed this up? It seems to me that it shouldn't
be necessary to load all of these Groups into memory as you know they
are not Arrays and therefore don't match the classname I specified.
Could some additional lazy loading be added?
Additionally, what would you think about allowing the user to specify a
depth-first iteration in the walk functions? That is how I will work
around the problem unless you think a fix to this issue will be
forthcoming very soon.
====
The script:
"""
import cProfile, tables
big = tables.openFile("big.h5")
cProfile.run('big.getNode("/_00", "ENST00000260061")', "leaf.prof")
cProfile.run('big.walkNodes("/_38", classname="Array").next()',
"group.prof")
cProfile.run('big.walkNodes("/", classname="Array").next()', "root.prof")
big.close()
"""
The results:
"""
$ for FILE in *.prof; do python -c "import pstats;
pstats.Stats(\"$FILE\").strip_dirs().sort_stats('time').print_stats(5)";
done
Sun Sep 23 22:47:11 2007 group.prof
792 function calls (781 primitive calls) in 1.641 CPU seconds
Ordered by: internal time
List reduced from 107 to 5 due to restriction <5>
ncalls tottime percall cumtime percall filename:lineno(function)
1 1.636 1.636 1.636 1.636 {method '_g_listGroup' of
'hdf5Extension.Group' objects}
1 0.001 0.001 1.638 1.638
group.py:389(_g_addChildrenNames)
56 0.000 0.000 0.001 0.000
file.py:563(_ptNameFromH5Name)
112 0.000 0.000 0.000 0.000 proxydict.py:33(__setitem__)
56 0.000 0.000 0.000 0.000 path.py:166(isVisibleName)
Sun Sep 23 22:47:09 2007 leaf.prof
676 function calls (665 primitive calls) in 1.307 CPU seconds
Ordered by: internal time
List reduced from 95 to 5 due to restriction <5>
ncalls tottime percall cumtime percall filename:lineno(function)
1 1.303 1.303 1.303 1.303 {method '_g_listGroup' of
'hdf5Extension.Group' objects}
1 0.001 0.001 1.304 1.304
group.py:389(_g_addChildrenNames)
24 0.001 0.000 0.001 0.000 group.py:883(__setattr__)
46 0.000 0.000 0.000 0.000
file.py:563(_ptNameFromH5Name)
92 0.000 0.000 0.000 0.000 proxydict.py:33(__setitem__)
Sun Sep 23 22:53:25 2007 root.prof
165265 function calls (164492 primitive calls) in 374.321 CPU
seconds
Ordered by: internal time
List reduced from 124 to 5 due to restriction <5>
ncalls tottime percall cumtime percall filename:lineno(function)
254 373.471 1.470 373.471 1.470 {method '_g_listGroup' of
'hdf5Extension.Group' objects}
254 0.187 0.001 374.042 1.473
group.py:389(_g_addChildrenNames)
16226 0.098 0.000 0.157 0.000
file.py:563(_ptNameFromH5Name)
32452 0.095 0.000 0.095 0.000 proxydict.py:33(__setitem__)
16226 0.059 0.000 0.099 0.000 path.py:166(isVisibleName)
"""
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users