Hi:

I am working on an open source project to write a search engine /
datamining framework of sorts for Wikipedia, and one of the first
things I need to do is to parse the Wikipedia dumps into an SQL
database. I am using the sqlalchemy to do this but it is very slow (at
the current rate, 130 days!!!). I am sure that I am doing something
wrong since I am new at this, and am wondering whether any sqlalchemy
veterans can offer his/her insights.

The code can be found here:
http://wikiminer.googlecode.com/svn/trunk/wikipedia_miner.py

The critical part of the code is this:

       for link_label, link_dest_title, dest_frag in
self.parse_links(self.text):
           print 'LINK from:', repr(self.title), 'to',
repr(link_dest_title + '#' + dest_frag), 'label', repr(link_label)
           try:
               link_dest =
session.query(Article).filter_by(title=link_dest_title).one()
           except sqlalchemy.orm.exc.NoResultFound:
               link_dest = None
           print link_dest
           session.add(Link(self, link_label, link_dest, dest_frag))

Basically what this does is that it parses the links in a page, looks
it up in the DB to resolve the reference, and then insert a Link into
the DB. The problem is that the "articles" table is over 7 million
rows and there are maybe 50 million links.

I have tried using both SQLite and Postgres as the database. Postgres
EXPLAIN ANALYZE claims that the above statements should take only
around 25 ms! I think I am doing something wrong with sqlalchemy,
maybe I am creating too many objects? Any help would be very
appreciated.

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To post to this group, send email to sqlalchemy@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to