Re: Words Indexing strategies
Some time ago, i posted a message asking for volunteers to create a Wikipedia CD/DVD. Since then, i have been working on this project and have done some advances, that will be published as soon they work as expected. These are the steps that i am following to process the XML databases: Wikipedia XML database are huge and we could get best results when XML databases are downloaded directly from: http://download.wikipedia.org/ The download direction for wikipedia English XML database from 2010 February 3 is: http://download.wikimedia.org/enwiki/20100130/enwiki-20100130-pages-articles.xml.bz2 download direction for wikipedia Spanish XML database from 2010 February 21 is: http://download.wikimedia.org/eswiki/20100221/eswiki-20100221-pages-meta-current.xml.bz2 After downloading the compressed xml database, you should put the database inside a folder (not in the disk root) and split the file in small bz2 files using bzip2recover. http://www.bzip.org/downloads.html http://www.bzip.org/1.0.5/bzip2recover-105-x86-win32.exe It is easier to deal with many compressed small files than using one humongous text file of more than 25 GB (english xml database) or 5.3 GB (spanish xml database). After using bzip2recover to split the English xml database, i get more than 28,000 small (~250 kb) bz2 files or 6800 small bz2 files for Spanish xml databases. Each one of these files have (more of less) a 1 MB segment of the database. Notice that i choose a different file for spanish xml database than english xml database. That is because Wikipedia have been unable to solve a problem with their backup of spanish xml database. https://bugzilla.wikimedia.org/show_bug.cgi?id=18694 Alejandro ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: Words Indexing strategies
Hi Bernard, Bernard Devlin-2 wrote: [snip] Am I right that given these search terms: baboon OR monkey AND fruit and index file b.tgz contains a line like this: baboon: 1,5,9 index file m.tgz contains a line like this: monkey: 2,7,17 index file f.tgz contains a line like this: fruit: 3,7,23 you would want the result of your search to be: 7 i.e. the number of the article that matches the boolean search? Unless I've misunderstood, what you want to do is combine indexes in order to satisfy boolean combinations of search terms. Yes, this is correct. With larger sets of data, should be easier to convert these results in arrays and merge them, following the explanations provided by Ken Ray in his website: http://www.sonsothunder.com/devres/revolution/tips/arry002.htm Now, you have made a valid point with the following commentary: Bernard Devlin-2 wrote: However, it looks to me like the existing indexes don't contain enough information for you to calculate frequency of occurrence (a measure of relevance). And depending on how these pre-existing indexes have been constructed they may not have any stemming information in them. You might be able to build some kind of rough stemming algorithm in Rev (by doing rough pluralization like 'baboon*', but as Richard pointed out more complex plurals like 'children' will be where the work comes). Are you looking for an approximate solution? Or do you need greater flexibility of scope and relevance scores, etc. ? Starting this project, an approximate solution could be fine, and eventually, keep working to refine this search algorithm. After taking a close look at the 580,000 articles index (and 450,000 redirections index), i understand that employing effectively an stemming algorithm could save many megabytes. You are right about lack of information in current simple index format. Relevance should be a function of the number of times a word appear in an article. An index that could include this information would be: monkey:3827#12|15,1#4|3,2131#18|3,34#3|2,4567#2|2,3456#22|1 In this new example, compressed Datafile number 3827, article # 12, there are 15 instances of the word Monkey... Surely, there should be some better notation to handle this data, feel free to send your ideas and comments. Again, converting this data in an array, make easier to work with. Richard Gaskin wrote: Once again, MetaCard to the rescue! :) Raney included this little gem in MC's Examples stack, and using repeat for each and arrays it's blazing fast, able to make a frequency table for even large files in almost no time at all: Yes, this works great! Many thanks for digging this handler. :-) Have a nice weekend! Alejandro -- View this message in context: http://n4.nabble.com/Words-Indexing-strategies-tp1473753p1554526.html Sent from the Revolution - User mailing list archive at Nabble.com. ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: Words Indexing strategies
On Fri, Feb 12, 2010 at 3:24 AM, Alejandro Tejada capellan2...@gmail.com wrote: I have a dll named: dbsqlite.dll (452 K) in my Rev Studio instalation. If an experienced database developer could lend a hand, i would be really grateful. Hi Alejandro Ok, since you don't know anything about databases, I'm assuming that it is also going to be quite a lot of work to deal with compiling the sqlite FT plugin for the different platforms too. (I've no idea how hard that would be or what problems you might meet with - but in general these things are always harder than they are supposed to be). So, let's consider a Rev-only solution, and if that looks like a hopeless case then it will make the work required to deal with databases and compilation more worthwhile. As I'm slow on the uptake, I am still not entirely sure what I think it is you are trying to do. Am I right that given these search terms: baboon OR monkey AND fruit and index file b.tgz contains a line like this: baboon: 1,5,9 index file m.tgz contains a line like this: monkey: 2,7,17 index file f.tgz contains a line like this: fruit: 3,7,23 you would want the result of your search to be: 7 i.e. the number of the article that matches the boolean search? Unless I've misunderstood, what you want to do is combine indexes in order to satisfy boolean combinations of search terms. However, it looks to me like the existing indexes don't contain enough information for you to calculate frequency of occurrence (a measure of relevance). And depending on how these pre-existing indexes have been constructed they may not have any stemming information in them. You might be able to build some kind of rough stemming algorithm in Rev (by doing rough pluralization like 'baboon*', but as Richard pointed out more complex plurals like 'children' will be where the work comes). Are you looking for an approximate solution? Or do you need greater flexibility of scope and relevance scores, etc. ? Bernard ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: Words Indexing strategies
Bernard Devlin wrote: However, it looks to me like the existing indexes don't contain enough information for you to calculate frequency of occurrence (a measure of relevance). Once again, MetaCard to the rescue! :) Raney included this little gem in MC's Examples stack, and using repeat for each and arrays it's blazing fast, able to make a frequency table for even large files in almost no time at all: on mouseUp put empty into field result answer file Select a text file for input: if it is empty then exit mouseUp # let user know we're working on it set the cursor to watch put it into inputFile open file inputFile for read read from file inputFile until eof put it into fileContent close file inputFile # wordCount is an associative array, its indexes are words # with the contents of each element being number of times # that word appears repeat for each word w in fileContent add 1 to wordCount[w] end repeat # copy all the indexes that is in the wordCount associative array put keys(wordCount) into keyWords # sort the indexes -- keyWords contains a list of elements in array sort keyWords repeat for each line l in keyWords put l tab wordCount[l] return after displayResult end repeat put displayResult into field result end mouseUp -- Richard Gaskin Fourth World Rev training and consulting: http://www.fourthworld.com Webzine for Rev developers: http://www.revjournal.com revJournal blog: http://revjournal.com/blog.irv ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: Words Indexing strategies
On Wed, Feb 10, 2010 at 10:30 PM, Alejandro Tejada capellan2...@gmail.com wrote: Yes, each one of these 28 text files will be compressed in gz format. When users look for a word, or many words, only these file(s) are decompressed and searched. Like Brian, I was going to suggest existing search technologies like Lucene. Why re-invent the wheel? I understand you not wanting to ship Java and get the user to install it. However there may be other pre-existing solutions to your problem. Is it imperative to your solution that these 28 text compressed text files are part of the solution? I mean, are you trying to maintain the structure of the solution such that someone who comes along and looks at your solution can see where Rev fits in. Or can your indexes be stored in a database? The reason I say this is because a) Valentina already has two forms of text searches (one form is very fast but only looks up single words, the other form can search an entire database using regex but is slower, and probably not fast enough for your requirements). Unless you already have Valentina for each platform, this solution will involve you in the cost of buying licenses. The other thing to consider is that sqlite already has a full-text search facility (although I think you may have to compile it as a sqlite plug-in and distribute it with your application). It does things like word-stemming, stop lists, frequencies, etc. You would have to distribute this sqlite add-on with your solution. http://ft3.sourceforge.net/v0.3/userguide.html http://michaeltrier.com/2008/7/13/full-text-search-on-sqlite I suspect that the sqlite option may be the best. I can't testify to the speed of its searching, but I seem to remember that the indexing is asynchronous (not that that would be a consideration in your application). Using a pre-existing solution (even one that requires some modification on your part) has got to be easier than building your own full-text search system. The above suggestions could still provide you with a cross-platform solution. If you still want a Rev-only solution let us know. Maybe someone else will chip in with suggestions :-) Hope that helps. Bernard ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: Words Indexing strategies
Hi Bernard, on Thu, 11 Feb 2010 09:13:46 + Bernard Devlin wrote: Like Brian, I was going to suggest existing search technologies like Lucene. Why re-invent the wheel? I understand you not wanting to ship Java and get the user to install it. However there may be other pre-existing solutions to your problem. Hopefully Is it imperative to your solution that these 28 text compressed text files are part of the solution? I mean, are you trying to maintain the structure of the solution such that someone who comes along and looks at your solution can see where Rev fits in. Or can your indexes be stored in a database? Sure, these indexes could be stored in a database. The reason I say this is because a) Valentina already has two forms of text searches (one form is very fast but only looks up single words, the other form can search an entire database using regex but is slower, and probably not fast enough for your requirements). Unless you already have Valentina for each platform, this solution will involve you in the cost of buying licenses. The other thing to consider is that sqlite already has a full-text search facility (although I think you may have to compile it as a sqlite plug-in and distribute it with your application). It does things like word-stemming, stop lists, frequencies, etc. You would have to distribute this sqlite add-on with your solution. I have a dll named: dbsqlite.dll (452 K) in my Rev Studio instalation. If an experienced database developer could lend a hand, i would be really grateful. If you still want a Rev-only solution let us know. Maybe someone else will chip in with suggestions :-) Yes, a Rev-only solution would be ideal. I remember that Rob Cozens wrote a Rev-Only database, but my inexperience with databases does not help me to fully appreciate this work. Thanks a lot for your advice! :-) Alejandro ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: Words Indexing strategies
Hi Bernard, on Wed, 10 Feb 2010 07:20:36 + Bernard Devlin wrote: Can I just clarify your problem? You want to be able to search for phrases (partial sentences, possibly with boolean logic) inside the text stored in the xml nodes of the article, once the article is found in the index? No, it's not a search inside the displayed article. It's a global search, within a general index created using all words from all articles of Wikipedia. (I do not believe that it's necessary to load this full index in memory, instead just open specific parts of this index when users start searching) For this reason, i am looking for advice to create an index structure that allows to implement a fast search algorithm, using multiple words (and boolean logic, if possible), similar to Wikipedia's own search engine or (better yet) just like google. :-) Many thanks for your interest in this question! Alejandro ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: Words Indexing strategies
On Wed, Feb 10, 2010 at 2:56 PM, Alejandro Tejada capellan2...@gmail.com wrote: No, it's not a search inside the displayed article. It's a global search, within a general index created using all words from all articles of Wikipedia. (I do not believe that it's necessary to load this full index in memory, instead just open specific parts of this index when users start searching) OK, so that's why you mention the different files for each letter of the alphabet. I'm still a bit confused. Normally an index would indicate a location for an indexed term That's what I assume your general index files are doing. What are the key terms like in this index, and what do they point to? Can you give us some examples? For this reason, i am looking for advice to create an index structure that allows to implement a fast search algorithm, using multiple words (and boolean logic, if possible), similar to Wikipedia's own search engine or (better yet) just like google. :-) To my confused and befuddled mind, it sounds like you are wanting to create an index of the index. That can't be right :-) I'm no expert in search algorithms. I have been hoping someone else would jump in who has done this kind of thing before. Are you wanting a pure, rev-only solution i.e. are you doing this to demonstrate what can be done using Rev alone? Bernard ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: Words Indexing strategies
Alejandro, The first step for this would likely include creating an inverted index. This means you store something like: monkey:1,34,3827,21314 Where the word being indexed in monkey and the numbers that follow are article IDs. Using this information it is pretty trivial to implement AND / OR. Just merge the article IDs as needed. You can use a mix of this inverted index and your regular index in combination to do other types of queries. You can even store word frequency information to find similar articles. With that said, I think you will have a very hard time crafting a competitive algorithm for this in Rev. A lot of the engines out there are very mature and there are many open source ones that you might consider bundling with your project instead of writing it yourself. Some things to watch out for: 1) File size. Indexing every word takes a lot of space, especially if you indices are in plain text and not a compact binary format 2) Common words. You might consider a stop word list, or a threshold. For example, if a word is in 20% of all articles, don't index it. Or if it's in your stop word list. 3) Root words / plurals. Can you detect that monkeys is the plural of monkey (or more complex cases)? An example of an open source engine that is very mature is Lucene, which can be run from the command-line in Java. Hi Bernard, on Wed, 10 Feb 2010 07:20:36 + Bernard Devlin wrote: Can I just clarify your problem? You want to be able to search for phrases (partial sentences, possibly with boolean logic) inside the text stored in the xml nodes of the article, once the article is found in the index? No, it's not a search inside the displayed article. It's a global search, within a general index created using all words from all articles of Wikipedia. (I do not believe that it's necessary to load this full index in memory, instead just open specific parts of this index when users start searching) For this reason, i am looking for advice to create an index structure that allows to implement a fast search algorithm, using multiple words (and boolean logic, if possible), similar to Wikipedia's own search engine or (better yet) just like google. :-) Many thanks for your interest in this question! Alejandro ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolutio ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: Words Indexing strategies
The ambitious Alejandro Tejada wrote: It's a global search, within a general index created using all words from all articles of Wikipedia. (I do not believe that it's necessary to load this full index in memory, instead just open specific parts of this index when users start searching) For this reason, i am looking for advice to create an index structure that allows to implement a fast search algorithm, using multiple words (and boolean logic, if possible), similar to Wikipedia's own search engine or (better yet) just like google. :-) A good place to start on that is the seminal paper describing the initial Google implementation, written by the founders: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page http://infolab.stanford.edu/~backrub/google.html But be warned: indexing is a deep topic, and may become a consuming passion. Roulette is rumored to have been invented by a monk who came to believe he could find a way to predict its outcomes, and eventually went mad trying. Indexing is a bit like that. :) A couple of the longer-term projects I work on need to incorporate good indexing of large corpuses, and my own code to that end has advanced only in small baby steps as I learn more about it. So while I have little in the way of applied code to share at the moment, I can offer a few theoretical pointers: From what I'm reading, my first advice would be to not bother unless you absolutely need to. If there's any way you can use an existing index you'll be a happier man to do so. But if you have to make your own index, you may find it helpful (or maddening) to consider the challenges involved with the various tenses and inflections of words, and how to determine the root word in its native form (the lemma) for your lookups. Currently I'm looking into the various lemmatization schemes available, since it can help tremendously to both keep the index small enough to be practical while returning more relevant search results. Lemmatization attempts this through linguistic rules; stemming could be said to be a form of cheating by using simpler algos less dependent on the nuances of a given language to attempt the same result. The differences are explained well here: Stemming and lemmatization http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html Links to specific stemming algos are at the bottom of this article: http://www.comp.lancs.ac.uk/computing/research/stemming/general/ Without lemmatization or stemming, searches for children will fail to find child, which could well be relevant to the searcher. And some stemming methods won't be able to transform children to child since it's an uncommon transformation linguistically, much less so than more common plural forms like just adding s or es to the end. There's a wealth of info available searching the web for indexing algorithms, stemming algorithms, etc. Doing it well may take a lifetime; fortunately it seems there are some cheating methods which will do most of the job well enough in less time. All that said, I have to wonder: if Wikipedia's content is available, isn't their index also available? Porting it from MySQL to SQLite seems a far less daunting task than writing an index from scratch. -- Richard Gaskin Fourth World Rev training and consulting: http://www.fourthworld.com Webzine for Rev developers: http://www.revjournal.com revJournal blog: http://revjournal.com/blog.irv ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: Words Indexing strategies
Many thanks for replying this question. :-) on Wed, 10 Feb 2010 15:30:23 + Bernard Devlin wrote: OK, so that's why you mention the different files for each letter of the alphabet. Yes, each one of these 28 text files will be compressed in gz format. When users look for a word, or many words, only these file(s) are decompressed and searched. I'm still a bit confused. Normally an index would indicate a location for an indexed term That's what I assume your general index files are doing. What are the key terms like in this index, and what do they point to? Can you give us some examples? Sure, look the example that Brian Yennie wrote: monkey:1,34,3827,21314 The numbers (after the word) are the names of compressed gz files. These files are located automatically or (if not found) manually, when the program starts. To my confused and befuddled mind, it sounds like you are wanting to create an index of the index. That can't be right :-) No, this program uses only three kind of indexes: 1) Articles 2) Articles name's redirections 3) Words I'm no expert in search algorithms. I have been hoping someone else would jump in who has done this kind of thing before. Are you wanting a pure, rev-only solution i.e. are you doing this to demonstrate what can be done using Rev alone? Ideally, this should be a Rev only solution, for cross-platform porting. on Wed, 10 Feb 2010 11:17:08 -0500 Brian Yennie wrote: The first step for this would likely include creating an inverted index. This means you store something like: monkey:1,34,3827,2131 Where the word being indexed in monkey and the numbers that follow are article IDs. Using this information it is pretty trivial to implement AND / OR. Just merge the article IDs as needed. You can use a mix of this inverted index and your regular index in combination to do other types of queries. You can even store word frequency information to find similar articles. Yes, this is correct and should work fine, but how could i write in the word index a range of article where a word appears consecutively: baboon:1934,2345,2346,2347,2348,2349,2350,2351,2352,2567,3578 With that said, I think you will have a very hard time crafting a competitive algorithm for this in Rev. A lot of the engines out there are very mature and there are many open source ones that you might consider bundling with your project instead of writing it yourself. Actually, i look for good performance and cross-platform portability. So Rev is my first choice. Some things to watch out for: 1) File size. Indexing every word takes a lot of space, especially if you indices are in plain text and not a compact binary format How could i convert this index format in a compact binary format? baboon:1934,2345,2346,2347,2348,2349,2350,2351,2352,2567,3578 monkey:1,34,3827,2131, 3456,4567,5678,5789,6123,6234,6456 2) Common words. You might consider a stop word list, or a threshold. For example, if a word is in 20% of all articles, don't index it. Or if it's in your stop word list. Previously i believed that stop words should appear in all articles. 3) Root words / plurals. Can you detect that monkeys is the plural of monkey (or more complex cases)? Richard wrote about a similar concern in his answer. I suppose that this feature is useful to recommend similar terms, when users start a new search. An example of an open source engine that is very mature is Lucene, which can be run from the command-line in Java. How could i run Java applications from Runrev, without asking users to install Java first? on Wed, 10 Feb 2010 08:26:30 -0800 Richard Gaskin wrote: The ambitious Alejandro Tejada wrote: this is only a really modest search :-D A good place to start on that is the seminal paper describing the initial Google implementation, written by the founders: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page http://infolab.stanford.edu/~backrub/google.html But be warned: indexing is a deep topic, and may become a consuming passion. Roulette is rumored to have been invented by a monk who came to believe he could find a way to predict its outcomes, and eventually went mad trying. Indexing is a bit like that. :) Many thanks for pointers in this direction. A couple of the longer-term projects I work on need to incorporate good indexing of large corpuses, and my own code to that end has advanced only in small baby steps as I learn more about it. [snip of really useful information] All that said, I have to wonder: if Wikipedia's content is available, isn't their index also available? An article index is available, but a word index is not. Porting it from MySQL to SQLite seems a far less daunting task than writing an index from scratch. SQLite is way over my head in this moment. :-) Thanks again for answering this request! Alejandro ___ use-revolution mailing list
Re: Words Indexing strategies
Yes, this is correct and should work fine, but how could i write in the word index a range of article where a word appears consecutively: baboon:1934,2345,2346,2347,2348,2349,2350,2351,2352,2567,3578 If this were your format, you could compact to something like: baboon:1934,2345-2352,2567,3578 How could i convert this index format in a compact binary format? baboon:1934,2345,2346,2347,2348,2349,2350,2351,2352,2567,3578 monkey:1,34,3827,2131, 3456,4567,5678,5789,6123,6234,6456 Well there are a lot of possibilities that are probably way beyond the scope of this discussion, however, for starters you could convert each number from text to binary. You could also go for a BTree structure, but that is going to be awfully difficult in Rev. Previously i believed that stop words should appear in all articles. I would go with a threshold for sure. Think about what it means for index size if a word is in 50% of all articles or more. And why would you want to search for that word anyway? Richard wrote about a similar concern in his answer. I suppose that this feature is useful to recommend similar terms, when users start a new search. Yes, but it's also built-in to the results in most modern search engines. It will help you return better results. Think of the simple case where someone searches for monkeys but doesn't find an article named Monkey. Although it seems obvious that these are not the same word, your users can easily be frustrated. How could i run Java applications from Runrev, without asking users to install Java first? You would have to find a way to bundle it with your app. The upside is, this would be much easier than trying to write something equivalent in Rev. You may vary well be able to craft something that meets your needs, but in terms of performance and accuracy you'll have a nearly impossible time matching some of the more mature search engines out there. ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: Words Indexing strategies
On Tue, Feb 9, 2010 at 12:00 AM, Alejandro Tejada capellan2...@gmail.com wrote: Now, i am looking for advice to create an index structure for searching specific words inside article's text. i have been unable to implement a fast search algorithm, using multiple words, similar to Wikipedia's own search engine. Every idea or advice is welcome. Can I just clarify your problem? You want to be able to search for phrases (partial sentences, possibly with boolean logic) inside the text stored in the xml nodes of the article, once the article is found in the index? Bernard ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution