Hi:
I am working on an open source project to write a search engine / datamining framework of sorts for Wikipedia, and one of the first things I need to do is to parse the Wikipedia dumps into an SQL database. I am using the sqlalchemy to do this but it is very slow (at the current rate, 130 days!!!). I am sure that I am doing something wrong since I am new at this, and am wondering whether any sqlalchemy veterans can offer his/her insights. The code can be found here: http://wikiminer.googlecode.com/svn/trunk/wikipedia_miner.py The critical part of the code is this: for link_label, link_dest_title, dest_frag in self.parse_links(self.text): print 'LINK from:', repr(self.title), 'to', repr(link_dest_title + '#' + dest_frag), 'label', repr(link_label) try: link_dest = session.query(Article).filter_by(title=link_dest_title).one() except sqlalchemy.orm.exc.NoResultFound: link_dest = None print link_dest session.add(Link(self, link_label, link_dest, dest_frag)) Basically what this does is that it parses the links in a page, looks it up in the DB to resolve the reference, and then insert a Link into the DB. The problem is that the "articles" table is over 7 million rows and there are maybe 50 million links. I have tried using both SQLite and Postgres as the database. Postgres EXPLAIN ANALYZE claims that the above statements should take only around 25 ms! I think I am doing something wrong with sqlalchemy, maybe I am creating too many objects? Any help would be very appreciated. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "sqlalchemy" group. To post to this group, send email to sqlalchemy@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en -~----------~----~----~----~------~----~------~--~---