Many thanks for replying this question. :-) on Wed, 10 Feb 2010 15:30:23 +0000 Bernard Devlin wrote:
> OK, so that's why you mention the different files for each letter of > the alphabet. Yes, each one of these 28 text files will be compressed in gz format. When users look for a word, or many words, only these file(s) are decompressed and searched. > I'm still a bit confused. Normally an index would indicate a location > for an indexed term That's what I assume your general index files are > doing. What are the key terms like in this index, and what do they > point to? Can you give us some examples? Sure, look the example that Brian Yennie wrote: monkey:1,34,3827,21314 The numbers (after the word) are the names of compressed gz files. These files are located automatically or (if not found) manually, when the program starts. > To my confused and befuddled mind, it sounds like you are wanting to > create an index of the index. That can't be right :-) No, this program uses only three kind of indexes: 1) Articles 2) Articles name's redirections 3) Words > I'm no expert in search algorithms. I have been hoping someone else > would jump in who has done this kind of thing before. > Are you wanting a pure, rev-only solution i.e. are you doing this to > demonstrate what can be done using Rev alone? Ideally, this should be a Rev only solution, for cross-platform porting. on Wed, 10 Feb 2010 11:17:08 -0500 Brian Yennie wrote: > The first step for this would likely include creating an "inverted" index. > This means you store something like: monkey:1,34,3827,2131 > Where the word being indexed in "monkey" and the numbers > that follow are article IDs. > Using this information it is pretty trivial to implement AND / OR. > Just merge the article IDs as needed. You can use a mix of this inverted > index and your regular index in combination to do other types of queries. > You can even store word frequency information to find similar articles. Yes, this is correct and should work fine, but how could i write in the word index a range of article where a word appears consecutively: baboon:1934,2345,2346,2347,2348,2349,2350,2351,2352,2567,3578 > With that said, I think you will have a very hard time crafting a competitive > algorithm for this in Rev. A lot of the engines out there are very mature and > there are many open source ones that you might consider bundling with > your project instead of writing it yourself. Actually, i look for good performance and cross-platform portability. So Rev is my first choice. > Some things to watch out for: > 1) File size. Indexing every word takes a lot of space, especially if you > indices > are in plain text and not a compact binary format How could i convert this index format in a compact binary format? baboon:1934,2345,2346,2347,2348,2349,2350,2351,2352,2567,3578 monkey:1,34,3827,2131, 3456,4567,5678,5789,6123,6234,6456 > 2) Common words. You might consider a "stop word" list, or a threshold. > For example, if a word is in > 20% of all articles, don't index it. > Or if it's in your "stop word" list. Previously i believed that stop words should appear in all articles. > 3) Root words / plurals. Can you detect that "monkeys" > is the plural of "monkey" (or more complex cases)? Richard wrote about a similar concern in his answer. I suppose that this feature is useful to recommend similar terms, when users start a new search. > An example of an open source engine that is very mature > is Lucene, which can be run from the command-line in Java. How could i run Java applications from Runrev, without asking users to install Java first? on Wed, 10 Feb 2010 08:26:30 -0800 Richard Gaskin wrote: > The ambitious Alejandro Tejada wrote: this is only a really modest search :-D > A good place to start on that is the seminal paper describing the > initial Google implementation, written by the founders: > > The Anatomy of a Large-Scale Hypertextual Web Search Engine > Sergey Brin and Lawrence Page > <http://infolab.stanford.edu/~backrub/google.html> > > But be warned: indexing is a deep topic, and may become a consuming > passion. Roulette is rumored to have been invented by a monk who came > to believe he could find a way to predict its outcomes, and eventually > went mad trying. Indexing is a bit like that. :) Many thanks for pointers in this direction. > A couple of the longer-term projects I work on need to incorporate good > indexing of large corpuses, and my own code to that end has advanced > only in small baby steps as I learn more about it. [snip of really useful information] > All that said, I have to wonder: if Wikipedia's content is available, > isn't their index also available? An article index is available, but a word index is not. > Porting it from MySQL to SQLite seems a far less daunting task than > writing an index from scratch. SQLite is way over my head in this moment. :-) Thanks again for answering this request! Alejandro _______________________________________________ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution