this is general programming approach, not sql specific. for a 7 mil objects... u have to try to do some "vertical" (wrong term probably) layer-splitting of the data. imagine the objects being rectangles on horizontal line, each containg same layers. now u walk the rectangles like for each in X: walk Y. u need to try to walk like for layer in Y: walk all X. i.e. if u have Object A that maps to tables tA and tB, if u do million objects, that would be million of (insert into tA, insert into tB and maybe update of foreignkey). u have to try to reach the other way around, one lump insertmillion into tA, another lump insertmillion into tB, whatever. if it's still slow - or hard to knit the links - u could generate some DB-specific dump/replication-format and import from there. i guess u can go for the the latter approach even now, esp. if your data does not change (or u pick the 99% constant part and import that one, then add the rest slowly and dynamicaly).
ciao svilen www.svilendobrev.com On Tuesday 23 September 2008 06:20:06 CodeIsMightier wrote: > Hi: > > > I am working on an open source project to write a search engine / > datamining framework of sorts for Wikipedia, and one of the first > things I need to do is to parse the Wikipedia dumps into an SQL > database. I am using the sqlalchemy to do this but it is very slow > (at the current rate, 130 days!!!). I am sure that I am doing > something wrong since I am new at this, and am wondering whether > any sqlalchemy veterans can offer his/her insights. > > The code can be found here: > http://wikiminer.googlecode.com/svn/trunk/wikipedia_miner.py > > The critical part of the code is this: > > for link_label, link_dest_title, dest_frag in > self.parse_links(self.text): > print 'LINK from:', repr(self.title), 'to', > repr(link_dest_title + '#' + dest_frag), 'label', repr(link_label) > try: > link_dest = > session.query(Article).filter_by(title=link_dest_title).one() > except sqlalchemy.orm.exc.NoResultFound: > link_dest = None > print link_dest > session.add(Link(self, link_label, link_dest, > dest_frag)) > > Basically what this does is that it parses the links in a page, > looks it up in the DB to resolve the reference, and then insert a > Link into the DB. The problem is that the "articles" table is over > 7 million rows and there are maybe 50 million links. > > I have tried using both SQLite and Postgres as the database. > Postgres EXPLAIN ANALYZE claims that the above statements should > take only around 25 ms! I think I am doing something wrong with > sqlalchemy, maybe I am creating too many objects? Any help would be > very appreciated. > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "sqlalchemy" group. To post to this group, send email to sqlalchemy@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en -~----------~----~----~----~------~----~------~--~---