On Tuesday 12 May 2009 14:00:41 Armando Serrano Lombillo wrote:
> Ok, it looks like we were writing similar emails at the same time. :)
>
> I'll change my code right away, but I'm still interested in what exactly
> was slowing my first approach. Was it the way I accessed the file, that is,
> is t.colinstances[ind] slow? Or was it that directly building the set is
> slower that using .add()? The difference is huge, as my impressions and
> your benchmarks showed.
That's a good question. As I was not certain on what was happening there,
I've done some profiling. Here are the routines that were consuming the most
for your first method:
Tue May 12 14:07:25 2009 tuniq1.prof
2401085 function calls (2401062 primitive calls) in 5.835 CPU seconds
Ordered by: internal time, call count
List reduced from 184 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
50000 2.788 0.000 3.092 0.000 {method '_fillCol' of
'tables.tableExtension.Row' objects}
50000 0.442 0.000 3.569 0.000 table.py:1496(_read)
100000 0.313 0.000 0.861 0.000 leaf.py:425(_processRange)
150030/150010 0.253 0.000 0.491 0.000 file.py:880(_getNode)
50005 0.241 0.000 5.759 0.000 table.py:2914(__getitem__)
150025 0.220 0.000 0.236 0.000 file.py:249(__getitem__)
50000 0.209 0.000 4.822 0.000 table.py:1553(read)
It is clear now that, for every element in the table a `Table.__getitem__()`
was issued for every *single* item in table. As this is a user-accessible
method, it has to do a lot of checks first in order to ensure that the user is
requesting a valid item, and this has a lot of overhead.
In comparison, the second method is using a table iterator, which is
implemented as an extension (i.e. it is fast) and besides, only performs
checks at the beginning. Also, by using the iterator you only have to read
each item once per run, instead of once per existing column (remember that
tables are implemented row-wise, and you were accessing items column-wise in
method1). Finally, the table iterators always do buffered I/O, so reading
data ahead and re-using this data in next iterations. All in all, this
approach is much faster.
The moral of this is: use table iterators whenever you can :)
Cheers,
--
Francesc Alted
------------------------------------------------------------------------------
The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your
production scanning environment may not be a perfect world - but thanks to
Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700
Series Scanner you'll get full speed at 300 dpi even with all image
processing features enabled. http://p.sf.net/sfu/kodak-com
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users