At 04:33 PM 11/18/2001 +0100, you wrote:
Hello,
I am building my database for the spider to fill but I have a problem.
SNIP
At first I thought about indexing only the words that seem relevants but
this way I can only make simple searches (ie : rabbit). Then I thought
about Indexing with the word, the previous one and the next one. This way I
should be able to make complex searches even on more than 3 words since each
new word can find next on or previous one and so on. eg : the - red -
rabbit - with - a - big - tail
Typically, the way this is handled is by storing in the index
record number and word number pairs. So if you are looking
for a phrase you simply look for words that have the same
record number and the word numbers are sequential.
It seems quite a good way to do it but since I would like to avoid indexing
noise words such as the or a it is not really satisfying.*
If you index multiple words as you have suggested above, you are
going to have a _huge_ index size. (You probably already will anyway.)
However, if you store the words as record num / word num pairs,
then you have a bit more flexibility to play around with.
Hope this helps,
-Art
--
Art Pollard
http://www.lextek.com/
Suppliers of High Performance Text Retrieval Engines.
--
This message was sent by the Internet robots and spiders discussion list
([EMAIL PROTECTED]). For list server commands, send help in the body of a message
to [EMAIL PROTECTED].