Francesc Altet wrote: > A Dimarts 14 Novembre 2006 23:08, Erin Sheldon escrigué: > >> On 11/14/06, John Hunter <[EMAIL PROTECTED]> wrote: >> >>> Has anyone written any code to facilitate dumping mysql query results >>> (mainly arrays of floats) into numpy arrays directly at the extension >>> code layer. The query results->list->array conversion can be slow. >>> >>> Ideally, one could do this semi-automagically with record arrays and >>> table introspection.... >>> >> I've been considering this as well. I use both postgres and Oracle >> in my work, and I have been using the python interfaces (cx_Oracle >> and pgdb) to get result lists and convert to numpy arrays. >> >> The question I have been asking myself is "what is the advantage >> of such an approach?". It would be faster, but by how >> much? Presumably the bottleneck for most applications will >> be data retrieval rather than data copying in memory. >> > > Well, that largely depends on your pattern to access the data in your > database. If you are accessing to regions of your database that have a > high degree of spatial locality (i.e. they are located in equal or > very similar places), the data is most probably already in memory (in > your filesystem cache or maybe in your database cache) and the > bottleneck will become the memory access. Of course, if you don't have > such a spatial locality in the access pattern, then the bottleneck > will be the disk. > > Just to see how DB 2.0 could benefit from adopting record arrays as > input buffers, I've done a comparison between SQLite3 and PyTables. > PyTables doesn't suport DB 2.0 as such, but it does use record arrays > as buffers internally so as to read data in an efficient way (there > should be other databases that features this, but I know PyTables best > ;) > > For this, I've used a modified version of a small benchmarking program > posted by Tim Hochberg in this same thread (it is listed at the end > of the message). Here are the results: > > setup SQLite took 23.5661110878 seconds > retrieve SQLite took 3.26717996597 seconds > setup PyTables took 0.139157056808 seconds > retrieve PyTables took 0.13444685936 seconds > > [SQLite results were obtained using an in-memory database, while > PyTables used an on-disk one. See the code.] > > So, yes, if your access pattern exhibits a high degree of locality, > you can expect a huge difference on the reading speed (more than 20x > for this example, but as this depends on the dataset size, it can be > even higher for larger datasets). > One weakness of this benchmark is that it doesn't break out how much of the sqlite3 overhead is inherent to the sqlite3 engine, which I expect is somewhat more complicated internally than PyTables, and how much is due to all the extra layers we go through to get the data into an array (native[in database]->Python Objects->Native[In record array]). To try to get at least a little handle on this, I add this test:
def querySQLite(conn): c = conn.cursor() c.execute('select * from demo where x = 0.0') y = np.fromiter(c, dtype=dtype) return y This returns very little data (in the cases I ran it actually returned no data). However is still needs to loop over all the records and examine them. Here's what the timings looked like: setup SQLite took 9.71799993515 seconds retrieve SQLite took 0.921999931335 seconds query SQLite took 0.313000202179 seconds I'm reluctant to conclude to conclude that 1/3 of the time is spent in traversing the database and 2/3 of the time in creating the data solely because databases are big voodoo to me. Still, we can probably conclude that traversing the data itself is pretty expensive and we would be unlikely to approach PyTables speed even if we didn't have the extra overhead. On the other hand, there's a factor of three or so improvement that could be realized by reducing overhead. Or maybe not. I think that the database has to return it's data a row at a time, so there's intrinsically a lot of copying that's going to happen. So, I think it's unclear whether getting the data directly in native format would be significantly cheaper. I suppose that the way to definitively test it would be to rewrite one of these tests in C. Any volunteers? I think it's probably safe to say that either way PyTables will cream sqllite3 in those fields where it's applicable. One of these days I really need to dig into PyTables. I'm sure I could use it for something. [snip] -tim ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion