Hi all, Some time ago, i posted a message asking for volunteers to create a Wikipedia CD/DVD.
Since then, i have been working on this project and have done some advances, that will be published as soon they work as expected. Now, i need advice about possible strategies to create a fast and responsive word index for all Wikipedia articles, similar to the capabilities demostrated by google search engine, with suggested search terms and similar words. Notice that to index the article's titles i am not using any database engine in this project. For memory constrains and performance reasons, these are the steps i followed: 1) Wikipedia XML database is divided in multiple small UTF8 text files (each aprox. 1 MB) compressed in .gz format (reduced to 350-250 Kb). I have files numbered from 00001 to 06455 for spanish Wikipedia. English Wikipedia runs from 00001 to 28750. NOTE: Using such small database files, allows users to read quickly any linked article because the program find, decompress and process a small file. This is fast, even in old computers. 2) Each database part is indexed for article titles and words. 3) These multiple index files are merged into one big UTF8 index text file arranged in alphabetical order. 4) Split the big UTF8 index text file in 28 small UTF8 index text files. That is, a different file for each letter: 1 file for Decimal ASCII 33 to 64: ! to @ 26 files for Decimal ASCII 65 to 90: A to Z 1 file for Decimal ASCII 91 and more... Largest UTF8 index text file is the letter C 5) When users click an article link, program checks for the first letter of clicked link and search article name in the corresponding index. That is: a linked article that starts with G is searched only in the UTF8 Article Index "G" This works fine 99.9% of time because there are some errors with names of linked articles. Now, i am looking for advice to create an index structure for searching specific words inside article's text. i have been unable to implement a fast search algorithm, using multiple words, similar to Wikipedia's own search engine. Every idea or advice is welcome. Thanks in advance! Alejandro _______________________________________________ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution