Re: [Pytables-users] Possible patches for PyTables1.4 for reading big tables directly into numarray recarrays

Stephen Simmons Wed, 10 May 2006 03:57:01 -0700

Hi Francesc,

Answers to your questions:

(1) Why did I read data into a numarray array rather thanTable.read(start:stop)?

Actually my first efforts used something similar to this, trying to reada 4m row database in as 16 chunks of 256k rows. On paper it seems like agood idea. But the reality under WinXP was very different: the first fewreads were very slow, with subsequent reads accelerating up to 4 timesfaster, presumably as the disk cache got better at reading ahead.Smaller buffer sizes of 16k rows achieved a average I/O throughputaround 3 times faster. (Linux's disk caching may be better, so this maynot matter so much to other users)

But while small buffers won on I/O speed, they carried a performancepenalty during data processing, due to spending more time in slow Pythoncode rather than numarray's C-speed ufuncs. Now my application is alittle extreme, as I am doing arithmetic operations on 6,000,000,000fields [61 fields x 4.1m rows/month x 24 months]. Any microsecondswasted on unnecessary Python function calls, attribute lookups, datacopying, allocating and deallocating memory etc quickly added up.

So I used the Table extension method _read_records() in preference toTable.read() for three reasons:

(i) Made no difference to code in the iterbuffer() iterator, but
(ii) Avoided setup overhead in Table.read() on each call, and
(iii) I could pass in a numarray buffer without modifying PyTables core code


(2) Use diff

I'm not a real programmer, just a credit card marketing manager wholurks in Python newsgroups. But I've recently discovered SVN, so nexttime I try to create patches with TortoiseSVN.







Francesc Altet wrote:

Hi Stephen,

Here are some patches Carabos might consider for PyTables 1.4. I've
developed these against PyTables 1.4alpha (20060505) to assist in


Thanks for your patches. Some questions:

Change #1: An iterator that returns a table's data into a numarray


Interesting approach. By the way, why did you not considered making
use of Table.read() and providing different values of start and stop?
If for whatever reason, you need to use the same recarray buffer in
each iteration, maybe an optional buffer in read could be better. Do
you think that your approach has any advantage over this proposal?

Change #2: Read the HDF chunksize parameter when opening a file


Nice thing. Will apply it.

Change #3: Allow direct control over the HDF chunksize when creating a file


Also good idea for users that want a finer control of their
chunksizes. We will look into this and plan to offer control for it as
well.

Cheers!

Ps.- If you want to make the life of poor PyTables developers better,
please, provide "diff -urN" patches next time (sorry I've no idea of
how create these in Win. Anyone?).



-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Possible patches for PyTables1.4 for reading big tables directly into numarray recarrays

Reply via email to