There's no guarantee that crawlers will be polite and honor robots.txt directives; the search-engine ones probably do, but the spammers' ones definitely don't and in fact probably pay special attention to what's excluded. (I have a honeypot entry in my robots.txt designed to catch and then block the malicious robots.) OTOH, since the user-agent data is also only as reliable as the intent of whoever sets the crawler up, filtering based on that may not be much help either. I seem to recall having read somewhere that it's possible to configure Apache to recognize "executables" independent of the OS and file extensions and associations? If that's true, perhaps that might lead to some solution to your problem.

Mark

-------- Original Message  --------
Subject: [EMAIL PROTECTED] Blocking crawling of CGIs
From: Tony Rice (trice) <[EMAIL PROTECTED]>
To: users@httpd.apache.org
Date: Tuesday, September 18, 2007 11:24:20 AM

We've had some instances where crawlers have stumbled onto a cgi script
which refers to itself and start pounding the server with requests to
that cgi.

There are so many CGI scripts on this server that I don't want to
maintain a huge robots.txt file.  Any suggestions on other techniques to
keep crawlers away from cgi scripts?  Check the browser with
BrowserMatch and then do something creative with "deny from env="?



---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: [EMAIL PROTECTED]
  "   from the digest: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to