[sqlalchemy] Re: Queries very slow for parsing Wikipedia dump -- any ideas for speeding it up?

Michael Bayer Tue, 23 Sep 2008 16:22:51 -0700


On Sep 22, 2008, at 11:20 PM, CodeIsMightier wrote:


>
>       for link_label, link_dest_title, dest_frag in
> self.parse_links(self.text):
>           print 'LINK from:', repr(self.title), 'to',
> repr(link_dest_title + '#' + dest_frag), 'label', repr(link_label)
>           try:
>               link_dest =
> session.query(Article).filter_by(title=link_dest_title).one()
>           except sqlalchemy.orm.exc.NoResultFound:
>               link_dest = None
>           print link_dest
>           session.add(Link(self, link_label, link_dest, dest_frag))
>
> Basically what this does is that it parses the links in a page, looks
> it up in the DB to resolve the reference, and then insert a Link into
> the DB. The problem is that the "articles" table is over 7 million
> rows and there are maybe 50 million links.


commenting only on this snippet:

1. throwing/catching exceptions in Python is time consuming.   So I  
would not rely upon one() to achieve this, instead say all() and look  
to see if rows are present.
2. if the point of session.add(Link(..)) is so that a new Link object  
is added, you *definitely* need to flush every 100 objects or so.  The  
Session, when flush() proceeds, has to sort its entire contents of  
dirty objects so this takes exponentially longer as the size of  
"unflushed" objects grows - so this list should be kept small.
3. the ORM's creation of objects is time consuming.  So I would look  
at selecting individual columns instead of objects, i.e.  
sess.query(Article.id).filter(...).all().
4. the ORM overall is designed to simplify a complex set of operations  
over a complex object graph, and is not optimized for "raw speed" as  
much as plain SQL expressions.  Since this is a very simplistic  
operation, you'll get much better performance using plain SQL  
expressions (see the tutorial at 
http://www.sqlalchemy.org/docs/05/sqlexpression.html 
  ).   I wouldn't use the ORM at all for this particular operation.
5. The whole pattern of "fetch a row, doesnt exist, then insert" is  
itself slow.   If you're using MySQL, consider using the REPLACE  
construct which would remove the need to check first for a row.
6. depending on the database in use, you probably want to commit your  
transaction every X number of rows - an exceedingly large amount of  
changes in one transaction can also put a  stress on the operation.
7. If the full database of articles is 7 million, you might even try  
to load the full list of article IDs and titles into an in-memory  
dictionary (only the integer id and string title, *not* a full ORM  
object) and operate that way.  It would use a few gigs of RAM but  
would only take a few minutes to try out to see if its feasable.    
Things like memcached and DBM might be of use here too.




--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To post to this group, send email to sqlalchemy@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en
-~----------~----~----~----~------~----~------~--~---

[sqlalchemy] Re: Queries very slow for parsing Wikipedia dump -- any ideas for speeding it up?

Reply via email to