On Sep 22, 2008, at 11:20 PM, CodeIsMightier wrote:
> > for link_label, link_dest_title, dest_frag in > self.parse_links(self.text): > print 'LINK from:', repr(self.title), 'to', > repr(link_dest_title + '#' + dest_frag), 'label', repr(link_label) > try: > link_dest = > session.query(Article).filter_by(title=link_dest_title).one() > except sqlalchemy.orm.exc.NoResultFound: > link_dest = None > print link_dest > session.add(Link(self, link_label, link_dest, dest_frag)) > > Basically what this does is that it parses the links in a page, looks > it up in the DB to resolve the reference, and then insert a Link into > the DB. The problem is that the "articles" table is over 7 million > rows and there are maybe 50 million links. commenting only on this snippet: 1. throwing/catching exceptions in Python is time consuming. So I would not rely upon one() to achieve this, instead say all() and look to see if rows are present. 2. if the point of session.add(Link(..)) is so that a new Link object is added, you *definitely* need to flush every 100 objects or so. The Session, when flush() proceeds, has to sort its entire contents of dirty objects so this takes exponentially longer as the size of "unflushed" objects grows - so this list should be kept small. 3. the ORM's creation of objects is time consuming. So I would look at selecting individual columns instead of objects, i.e. sess.query(Article.id).filter(...).all(). 4. the ORM overall is designed to simplify a complex set of operations over a complex object graph, and is not optimized for "raw speed" as much as plain SQL expressions. Since this is a very simplistic operation, you'll get much better performance using plain SQL expressions (see the tutorial at http://www.sqlalchemy.org/docs/05/sqlexpression.html ). I wouldn't use the ORM at all for this particular operation. 5. The whole pattern of "fetch a row, doesnt exist, then insert" is itself slow. If you're using MySQL, consider using the REPLACE construct which would remove the need to check first for a row. 6. depending on the database in use, you probably want to commit your transaction every X number of rows - an exceedingly large amount of changes in one transaction can also put a stress on the operation. 7. If the full database of articles is 7 million, you might even try to load the full list of article IDs and titles into an in-memory dictionary (only the integer id and string title, *not* a full ORM object) and operate that way. It would use a few gigs of RAM but would only take a few minutes to try out to see if its feasable. Things like memcached and DBM might be of use here too. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "sqlalchemy" group. To post to this group, send email to sqlalchemy@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en -~----------~----~----~----~------~----~------~--~---