On 14 October 2010 09:37, Alex Brollo <alex.bro...@gmail.com> wrote: > 2010/10/13 Paul Houle <p...@ontology2.com> > >> >> Don't be intimidated by working with the data dumps. If you've got >> an XML API that does streaming processing (I used .NET's XmlReader) and >> use the old unix trick of piping the output of bunzip2 into your >> program, it's really pretty easy. >> > > When I worked into it.source (a small dump! something like 300Mby unzipped), > I used a simple do-it-yourself string python search routine and I found it > really faster then python xml routines. I presume that my scripts are really > too rough to deserve sharing, but I encourage programmers to write a "simple > dump reader" using speed of string search. My personal trick was to build an > "index", t.i. a list of pointers to articles and name of articles into xml > file, so that it was simple and fast to recover their content. I used it > mainly because I didn't understand API at all. ;-) > > Alex
Hi Alex. I have been doing something similar in Perl for a few years for the English Wiktionary. I've never been sure on the best way to store all the index files I create especially in code to share with other people like I would like to happen. If you'd like to collaborate or anyone else for that matter it would be pretty cool. You'll find my stuff on the Toolserver: https://fisheye.toolserver.org/browse/enwikt Andrew Dunbar (hippietrail) -- http://wiktionarydev.leuksman.com http://linguaphile.sf.net _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l