Hi Francesc,
Answers to your questions:
(1) Why did I read data into a numarray array rather than
Table.read(start:stop)?
Actually my first efforts used something similar to this, trying to read
a 4m row database in as 16 chunks of 256k rows. On paper it seems like a
good idea. But the reality under WinXP was very different: the first few
reads were very slow, with subsequent reads accelerating up to 4 times
faster, presumably as the disk cache got better at reading ahead.
Smaller buffer sizes of 16k rows achieved a average I/O throughput
around 3 times faster. (Linux's disk caching may be better, so this may
not matter so much to other users)
But while small buffers won on I/O speed, they carried a performance
penalty during data processing, due to spending more time in slow Python
code rather than numarray's C-speed ufuncs. Now my application is a
little extreme, as I am doing arithmetic operations on 6,000,000,000
fields [61 fields x 4.1m rows/month x 24 months]. Any microseconds
wasted on unnecessary Python function calls, attribute lookups, data
copying, allocating and deallocating memory etc quickly added up.
So I used the Table extension method _read_records() in preference to
Table.read() for three reasons:
(i) Made no difference to code in the iterbuffer() iterator, but
(ii) Avoided setup overhead in Table.read() on each call, and
(iii) I could pass in a numarray buffer without modifying PyTables core code
(2) Use diff
I'm not a real programmer, just a credit card marketing manager who
lurks in Python newsgroups. But I've recently discovered SVN, so next
time I try to create patches with TortoiseSVN.
Francesc Altet wrote:
Hi Stephen,
Here are some patches Carabos might consider for PyTables 1.4. I've
developed these against PyTables 1.4alpha (20060505) to assist in
Thanks for your patches. Some questions:
Change #1: An iterator that returns a table's data into a numarray
Interesting approach. By the way, why did you not considered making
use of Table.read() and providing different values of start and stop?
If for whatever reason, you need to use the same recarray buffer in
each iteration, maybe an optional buffer in read could be better. Do
you think that your approach has any advantage over this proposal?
Change #2: Read the HDF chunksize parameter when opening a file
Nice thing. Will apply it.
Change #3: Allow direct control over the HDF chunksize when creating a file
Also good idea for users that want a finer control of their
chunksizes. We will look into this and plan to offer control for it as
well.
Cheers!
Ps.- If you want to make the life of poor PyTables developers better,
please, provide "diff -urN" patches next time (sorry I've no idea of
how create these in Win. Anyone?).
-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users