[sqlalchemy] Re: Queries very slow for parsing Wikipedia dump -- any ideas for speeding it up?

Shawn Church Tue, 23 Sep 2008 05:50:31 -0700

Just a couple thoughts that might help you out:


1) I would profile the code.  It seems to me that running a regular
expression on an entire
wikipedia article would be a VERY expensive operation.

2) Did the first pass succeed and how long did it take?

3) Taking a quick look at
http://wikiminer.googlecode.com/svn/trunk/wikipedia_miner.py,  it
seems to me that the second pass through the data would create a 2nd
set of article objects
that are never saved to the database.  Therefore,  the 'self' reference in:

    session.add(Link(self, link_label, link_dest, dest_frag))

would refer to an object that is never saved.  I guess this would not
matter since the id
field is correct (since you set it explicitly) but it seems to me that
it might be better
(faster) if you just read through the articles table for pass two
instead of re-parsing the
xml, something similar to:

    #delete previous data
    redirects_table.drop(bind=engine)
    redirects_table.create(bind=engine)
    links_table.drop(bind=engine)
    links_table.create(bind=engine)

   for article in session.query(Article):
      article.parse_text(session)


It's pretty late so I may of missed something.  Hope the above helps.

Shawn Church

I/S Consultant
shawn at SChurchComputers.com

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To post to this group, send email to sqlalchemy@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en
-~----------~----~----~----~------~----~------~--~---

[sqlalchemy] Re: Queries very slow for parsing Wikipedia dump -- any ideas for speeding it up?

Reply via email to