przemol...@poczta.fm, 10.08.2011 15:31:
On Wed, Aug 10, 2011 at 01:32:06PM +0100, Chris Angelico wrote:
On Wed, Aug 10, 2011 at 12:17 PM,<przemol...@poczta.fm>  wrote:
I'd like to write a python (2.6/2.7) script which connects to database, fetches
hundreds of thousands of rows, concat them (basically: create XML)
and then put the result into another table. Do I have any choice
regarding string concatenation in Python from the performance point of view ?
Since the number of rows is big I'd like to use the fastest possible library
(if there is any choice). Can you recommend me something ?

First off, I have no idea why you would want to create an XML dump of
hundreds of thousands of rows, only to store it in another table.
However, if that is your intention, list joining is about as efficient
as you're going to get in Python:

lst=["asdf","qwer","zxcv"] # feel free to add 399,997 more list entries
xml="<foo>"+"</foo><foo>".join(lst)+"</foo>"

This sets xml to '<foo>asdf</foo><foo>qwer</foo><foo>zxcv</foo>' which
may or may not be what you're after.

since this process (XML building) is running now inside database (using native 
SQL commands)
and is one-thread task it is quite slow. What I wanted to do is to spawn 
several python subprocesses in parallel which
will concat subset of the whole table (and then merge all of them at the end).
Basically:
- fetch all rows from the database (up to 1 million): what is recommended data 
type ?
- spawn X python processes each one:
     - concat its own subset
- merge the result from all the subprocesses

This task is running on a server which has many but slow cores and I am trying 
to divide this task
into many subtasks.

Makes sense to me. Note that the really good DBMSes (namely, PostgreSQL) come with built-in Python support.

You still didn't provide enough information to make me understand why you need XML in between one database and another (or the same?), but if you go that route, you can just read data through multiple connections in multiple threads (or processes), have each build up one (or more) XML entries, and then push those into a queue. Then another thread (or more than one) can read from that queue and write the XML items into a file (or another database) as they come in.

If your data has a considerable size, I wouldn't use string concatenation or joining at all (note that it requires 2x the memory during concatenation), but rather write it into a file, or even just process the data on the fly, i.e. write it back into the target table right away. Reading a file back in after the fact is much more resource friendly than keeping huge amounts of data in memory. And disk speed is usually not a problem when streaming data from disk into a database.

It may also be worth considering to write out literal SQL insert statements instead of XML. Most databases have a decent bulk upload tool.

Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to