[Robots] Re: Indexed keywords

2001-11-20 Thread Art Pollard


At 04:33 PM 11/18/2001 +0100, you wrote:

Hello,

I am building my database for the spider to fill but I have a problem.
SNIP
At first I thought about indexing only the words that seem relevants but
this way I can only make simple searches (ie : rabbit). Then I thought
about Indexing with the word, the previous one and the next one. This way I
should be able to make complex searches even on more than 3 words since each
new word can find next on or previous one and so on.  eg : the - red -
rabbit - with - a - big - tail

Typically, the way this is handled is by storing in the index
record number and word number pairs.  So if you are looking
for a phrase you simply look for words that have the same
record number and the word numbers are sequential.

It seems quite a good way to do it but since I would like to avoid indexing
noise words such as the or a it is not really satisfying.*

If you index multiple words as you have suggested above, you are
going to have a _huge_ index size.  (You probably already will anyway.)
However, if you store the words as record num / word num pairs,
then you have a bit more flexibility to play around with.

Hope this helps,

-Art
-- 
Art Pollard
http://www.lextek.com/
Suppliers of High Performance Text Retrieval Engines.


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].




[Robots] Anti-thesaurus proposal

2001-11-20 Thread Nick Arnett


http://www.hastingsresearch.com/net/06-anti-thesaurus.shtml

This is a proposal for a meta-tag to tell search engines to ignore certain
words on a page when scoring relevancy.  Among other things, it mentions
robots.txt as problematic:

Also, returning to the robots.txt standard: it may be underused simply
because it is a security breach (the file openly lists URLs that webmasters
do not want visible through search engines). It is possible that many more
webmasters would be using it properly, if not for that security problem.

My opinion is that this is enormously impractical, but perhaps there's the
seed of a good idea in it.  However, it seems to me that if the authors of a
page would actually bother to create meta-tags to increase search
efficiency, it would be much easier (semi-automated, even) to create a tag
containing the *most* relevant words, not the least.

Nick Arnett
Phone/fax: 408-904-7198


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].