On 14 October 2010 09:37, Alex Brollo <alex.bro...@gmail.com> wrote:
> 2010/10/13 Paul Houle <p...@ontology2.com>
>
>>
>>     Don't be intimidated by working with the data dumps.  If you've got
>> an XML API that does streaming processing (I used .NET's XmlReader) and
>> use the old unix trick of piping the output of bunzip2 into your
>> program,  it's really pretty easy.
>>
>
> When I worked into it.source (a small dump! something like 300Mby unzipped),
> I used a simple do-it-yourself string python search routine  and I found it
> really faster then python xml routines. I presume that my scripts are really
> too rough to deserve sharing, but I encourage programmers to write a "simple
> dump reader" using speed of string search. My personal trick was to build an
> "index", t.i. a list of pointers to articles and name of articles  into xml
> file, so that it was simple and fast to recover their content. I used it
> mainly because I didn't understand API at all. ;-)
>
> Alex


Hi Alex. I have been doing something similar in Perl for a few years
for the English
Wiktionary. I've never been sure on the best way to store all the
index files I create
especially in code to share with other people like I would like to
happen. If you'd
like to collaborate or anyone else for that matter it would be pretty cool.

You'll find my stuff on the Toolserver:
https://fisheye.toolserver.org/browse/enwikt

Andrew Dunbar (hippietrail)


-- 
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to