At 04:33 PM 11/18/2001 +0100, you wrote:

>Hello,
>
>I am building my database for the spider to fill but I have a problem.
><SNIP>
>At first I thought about indexing only the words that seem relevants but
>this way I can only make simple searches (ie : "rabbit"). Then I thought
>about Indexing with the word, the previous one and the next one. This way I
>should be able to make complex searches even on more than 3 words since each
>new word can find next on or previous one and so on.  eg : the -> red ->
>rabbit -> with -> a -> big -> tail

Typically, the way this is handled is by storing in the index
record number and word number pairs.  So if you are looking
for a phrase you simply look for words that have the same
record number and the word numbers are sequential.

>It seems quite a good way to do it but since I would like to avoid indexing
>"noise words" such as "the" or "a" it is not really satisfying.*

If you index multiple words as you have suggested above, you are
going to have a _huge_ index size.  (You probably already will anyway.)
However, if you store the words as record num / word num pairs,
then you have a bit more flexibility to play around with.

Hope this helps,

-Art
-- 
Art Pollard
http://www.lextek.com/
Suppliers of High Performance Text Retrieval Engines.


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".

Reply via email to