Just a couple thoughts that might help you out:
1) I would profile the code. It seems to me that running a regular expression on an entire wikipedia article would be a VERY expensive operation. 2) Did the first pass succeed and how long did it take? 3) Taking a quick look at http://wikiminer.googlecode.com/svn/trunk/wikipedia_miner.py, it seems to me that the second pass through the data would create a 2nd set of article objects that are never saved to the database. Therefore, the 'self' reference in: session.add(Link(self, link_label, link_dest, dest_frag)) would refer to an object that is never saved. I guess this would not matter since the id field is correct (since you set it explicitly) but it seems to me that it might be better (faster) if you just read through the articles table for pass two instead of re-parsing the xml, something similar to: #delete previous data redirects_table.drop(bind=engine) redirects_table.create(bind=engine) links_table.drop(bind=engine) links_table.create(bind=engine) for article in session.query(Article): article.parse_text(session) It's pretty late so I may of missed something. Hope the above helps. Shawn Church I/S Consultant shawn at SChurchComputers.com --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "sqlalchemy" group. To post to this group, send email to sqlalchemy@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en -~----------~----~----~----~------~----~------~--~---