On Wed, Aug 10, 2011 at 06:20:10PM +0200, Stefan Behnel wrote: > przemol...@poczta.fm, 10.08.2011 15:31: >> On Wed, Aug 10, 2011 at 01:32:06PM +0100, Chris Angelico wrote: >>> On Wed, Aug 10, 2011 at 12:17 PM,<przemol...@poczta.fm> wrote: >>>> I'd like to write a python (2.6/2.7) script which connects to database, >>>> fetches >>>> hundreds of thousands of rows, concat them (basically: create XML) >>>> and then put the result into another table. Do I have any choice >>>> regarding string concatenation in Python from the performance point of >>>> view ? >>>> Since the number of rows is big I'd like to use the fastest possible >>>> library >>>> (if there is any choice). Can you recommend me something ? >>> >>> First off, I have no idea why you would want to create an XML dump of >>> hundreds of thousands of rows, only to store it in another table. >>> However, if that is your intention, list joining is about as efficient >>> as you're going to get in Python: >>> >>> lst=["asdf","qwer","zxcv"] # feel free to add 399,997 more list entries >>> xml="<foo>"+"</foo><foo>".join(lst)+"</foo>" >>> >>> This sets xml to '<foo>asdf</foo><foo>qwer</foo><foo>zxcv</foo>' which >>> may or may not be what you're after. >> >> since this process (XML building) is running now inside database (using >> native SQL commands) >> and is one-thread task it is quite slow. What I wanted to do is to spawn >> several python subprocesses in parallel which >> will concat subset of the whole table (and then merge all of them at the >> end). >> Basically: >> - fetch all rows from the database (up to 1 million): what is recommended >> data type ? >> - spawn X python processes each one: >> - concat its own subset >> - merge the result from all the subprocesses >> >> This task is running on a server which has many but slow cores and I am >> trying to divide this task >> into many subtasks. > > Makes sense to me. Note that the really good DBMSes (namely, PostgreSQL) > come with built-in Python support.
The data are in Oracle so I have to use cx_oracle. > You still didn't provide enough information to make me understand why you > need XML in between one database and another (or the same?), but if you > go that route, you can just read data through multiple connections in > multiple threads (or processes), have each build up one (or more) XML > entries, and then push those into a queue. Then another thread (or more > than one) can read from that queue and write the XML items into a file > (or another database) as they come in. I am not a database developer so I don't want to change the whole process of data flow between applications in my company. Another process is reading this XML from particular Oracle table so I have to put the final XML there. > If your data has a considerable size, I wouldn't use string concatenation > or joining at all (note that it requires 2x the memory during > concatenation), but rather write it into a file, or even just process the > data on the fly, i.e. write it back into the target table right away. > Reading a file back in after the fact is much more resource friendly than > keeping huge amounts of data in memory. And disk speed is usually not a > problem when streaming data from disk into a database. This server has 256 GB of RAM so memory is not a problem. Also the select which fetches the data is sorted. That is why I have to carefully divide into subtasks and then merge it in correct order. Regards Przemyslaw Bak (przemol) ---------------------------------------------------------------- Dom marzen - kup lub wynajmij taniej niz myslisz! Szukaj >> http://linkint.pl/f2a0d -- http://mail.python.org/mailman/listinfo/python-list