[sqlalchemy] Re: Queries very slow for parsing Wikipedia dump -- any ideas for speeding it up?

az Tue, 23 Sep 2008 01:57:03 -0700

this is general programming approach, not sql specific.

for a 7 mil objects... u have to try to do some "vertical" (wrong term 
probably) layer-splitting of the data. imagine the objects being 
rectangles on horizontal line, each containg same layers. now u walk 
the rectangles like for each in X: walk Y. u need to try to walk like 
for layer in Y: walk all X.
i.e. if u have Object A that maps to tables tA and tB, if u do million 
objects, that would be million of (insert into tA, insert into tB and 
maybe update of foreignkey). u have to try to reach the other way 
around, one lump insertmillion into tA, another lump insertmillion 
into tB, whatever. 
if it's still slow - or hard to knit the links - u could generate some 
DB-specific dump/replication-format and import from there.
i guess u can go for the the latter approach even now, esp. if your 
data does not change (or u pick the 99% constant part and import that 
one, then add the rest slowly and dynamicaly).


ciao
svilen
www.svilendobrev.com


On Tuesday 23 September 2008 06:20:06 CodeIsMightier wrote:
> Hi:
>
>
> I am working on an open source project to write a search engine /
> datamining framework of sorts for Wikipedia, and one of the first
> things I need to do is to parse the Wikipedia dumps into an SQL
> database. I am using the sqlalchemy to do this but it is very slow
> (at the current rate, 130 days!!!). I am sure that I am doing
> something wrong since I am new at this, and am wondering whether
> any sqlalchemy veterans can offer his/her insights.
>
> The code can be found here:
> http://wikiminer.googlecode.com/svn/trunk/wikipedia_miner.py
>
> The critical part of the code is this:
>
>        for link_label, link_dest_title, dest_frag in
> self.parse_links(self.text):
>            print 'LINK from:', repr(self.title), 'to',
> repr(link_dest_title + '#' + dest_frag), 'label', repr(link_label)
>            try:
>                link_dest =
> session.query(Article).filter_by(title=link_dest_title).one()
>            except sqlalchemy.orm.exc.NoResultFound:
>                link_dest = None
>            print link_dest
>            session.add(Link(self, link_label, link_dest,
> dest_frag))
>
> Basically what this does is that it parses the links in a page,
> looks it up in the DB to resolve the reference, and then insert a
> Link into the DB. The problem is that the "articles" table is over
> 7 million rows and there are maybe 50 million links.
>
> I have tried using both SQLite and Postgres as the database.
> Postgres EXPLAIN ANALYZE claims that the above statements should
> take only around 25 ms! I think I am doing something wrong with
> sqlalchemy, maybe I am creating too many objects? Any help would be
> very appreciated.
>
> 


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To post to this group, send email to sqlalchemy@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en
-~----------~----~----~----~------~----~------~--~---

[sqlalchemy] Re: Queries very slow for parsing Wikipedia dump -- any ideas for speeding it up?

Reply via email to