[Robots] Re: Anti-thesaurus proposal

Walter Underwood 20 Nov 2001 23:10:57 -0000


> Nick Arnett quotes Hastings Research:
> 
> "Also, returning to the robots.txt standard: it may be underused simply
> because it is a security breach (the file openly lists URLs that webmasters
> do not want visible through search engines).


Anyone who describes robots.txt as a security "breach" is misusing
the term. At most, it is a minor "risk", as described in the
security section of the robots.txt internet-draft from 1996:

   Web site administrators must realise this method is voluntary, and
   is not sufficient to guarantee some robots will not visit restricted
   parts of the URL space. Failure to use proper authentication or other
   restriction may result in exposure of restricted information. It even
   possible that the occurence of paths in the /robots.txt file may
   expose the existence of resources not otherwise linked to on the
   site, which may aid people guessing for URLs.

"Failure to use proper authentication ..." is the important part here.
Denying access to those URLs is the job of the web server, not robots.txt.

As for the anti-thesaurus proposal, many search engines already provide
something that does a similar job. You can mark sections of a document
to not be indexed. Usually, you want to do this for the topnav, sidebars,
ads, and copyright block.

For example, Inktomi Enterprise Search uses <!--stopindex--> and
<!--startindex--> to turn indexing off and on within a page. Other
engines use different tags.

Since several companies invented the same solution independently,
this is probably a better fit to the underlying problem.

wunder
--
Walter Underwood
[EMAIL PROTECTED]
Senior Staff Engineer, Inktomi
http://www.inktomi.com/


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".

[Robots] Re: Anti-thesaurus proposal

Reply via email to