> Nick Arnett quotes Hastings Research: > > "Also, returning to the robots.txt standard: it may be underused simply > because it is a security breach (the file openly lists URLs that webmasters > do not want visible through search engines).
Anyone who describes robots.txt as a security "breach" is misusing the term. At most, it is a minor "risk", as described in the security section of the robots.txt internet-draft from 1996: Web site administrators must realise this method is voluntary, and is not sufficient to guarantee some robots will not visit restricted parts of the URL space. Failure to use proper authentication or other restriction may result in exposure of restricted information. It even possible that the occurence of paths in the /robots.txt file may expose the existence of resources not otherwise linked to on the site, which may aid people guessing for URLs. "Failure to use proper authentication ..." is the important part here. Denying access to those URLs is the job of the web server, not robots.txt. As for the anti-thesaurus proposal, many search engines already provide something that does a similar job. You can mark sections of a document to not be indexed. Usually, you want to do this for the topnav, sidebars, ads, and copyright block. For example, Inktomi Enterprise Search uses <!--stopindex--> and <!--startindex--> to turn indexing off and on within a page. Other engines use different tags. Since several companies invented the same solution independently, this is probably a better fit to the underlying problem. wunder -- Walter Underwood [EMAIL PROTECTED] Senior Staff Engineer, Inktomi http://www.inktomi.com/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".