Re: [Pytables-users] Possible patches for PyTables1.4 for reading big tables directly into numarray recarrays

Francesc Altet Wed, 10 May 2006 06:06:15 -0700

A Dimecres 10 Maig 2006 12:55, Stephen Simmons va escriure:
> Hi Francesc,
>
> Answers to your questions:
>
> (1) Why did I read data into a numarray array rather than
> Table.read(start:stop)?
>
> Actually my first efforts used something similar to this, trying to read
> a 4m row database in as 16 chunks of 256k rows. On paper it seems like a
> good idea. But the reality under WinXP was very different: the first few
> reads were very slow, with subsequent reads accelerating up to 4 times
> faster, presumably as the disk cache got better at reading ahead.
> Smaller buffer sizes of 16k rows achieved a average I/O throughput
> around 3 times faster. (Linux's disk caching may be better, so this may
> not matter so much to other users)
>
> But while small buffers won on I/O speed, they carried a performance
> penalty during data processing, due to spending more time in slow Python
> code rather than numarray's C-speed ufuncs. Now my application is a
> little extreme, as I am doing arithmetic operations on 6,000,000,000
> fields [61 fields x 4.1m rows/month x 24 months]. Any microseconds
> wasted on unnecessary Python function calls, attribute lookups, data
> copying, allocating and deallocating memory etc quickly added up.
>
> So I used the Table extension method _read_records() in preference to
> Table.read() for three reasons:
> (i) Made no difference to code in the iterbuffer() iterator, but
> (ii) Avoided setup overhead in Table.read() on each call, and
> (iii) I could pass in a numarray buffer without modifying PyTables core
> code


I've studied your code more in-deep and I've detected a couple of
inconveniences:

1.- You have created a regular RecArray which does not support nested
records. For a general solution to be integrated in PyTables, you
should have created a NestedRecArray.

2.- More importantly, you need to create a RecArray of the same length
than the table. For small tables (i.e. those than fit in memory), this
is not a problem. But what happens in the more general case?: a malloc
error. So, this is not acceptable at all for the general case.

But, wait, I think that your case could be greatly alleviated by
introducing a new public method like:

    def readRange(start, stop, buffer):
        """Read a table range and put it in an heterogeneous buffer object.
        
        No check is made at all in the start and stop indices or whether
        your buffer is adequate to keep your data or not. Use this function
        when you want extreme read speed and at your own risk!.
        """
        
        self._read_records(start, stop-start, buffer)

and you can create a buffer that fits your own needs (i.e. it doesn't
have to be a NestedRecArray if a regular RecArray is enough for you),
and pass it along.

Other people think that such a readRange would be useful for them? In
such a case, and *if* benchmarks shows that it has a real advantage
over regular .read(), I'm open to include it.

Cheers!

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"



-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Possible patches for PyTables1.4 for reading big tables directly into numarray recarrays

Reply via email to