Are your 404 errors part of a Denial of Service attack?  That would be a
good reason to set up a blocking mechanism.  Though, as others have pointed
out Apache handles 404 errors very efficiently, so you should make sure that
this is really an issue.  Measure bandwidth, disk and CPU resources used for
servicing 404 pages as a fraction of all requests.  If it is the 404 errors
that are causing you problems, then certainly block sites that are causing
them.

Personally, I find most 404 errors on my web sites are due to broken links.
whatexit.org has plenty of them right now (d'oh!).

Real spiders don't guess random URLs to crawl the web.  That would be a
waste of time.  In a world with billions (trillians?) of web sites, spiders
aren't looking to make more work for themselves.  They look for links on
pages and crawl those links.  They do re-crawl pages that they've seen
before to look for updates, so if you remove a web page there is a good
chance that you will see occasional 404 errors for it.  That's a good
thing.  You want the spider to see that it has really gone away so that they
stop listing it in their search results.

There are two ways to direclty influence spiders:

Negative hints: A robots.txt file (http://www.robotstxt.org) can be used to
indicate which URLs you don't want crawled.  All major web search engines
obey robots.txt, and the ones that don't obey it quickly get banned or
realize that it is in their best interest to pay attention to robots.txt.
This is particularly important for "infinite" web sites (like if you have a
calendar with a "next month" button that can be clicked until the year
9999.  You probably don't want a spider clicking "next month" for days on
end and the spiders have better sites to search anyway.)

Positive hints:  There is a standard called "XML Sitemaps" (
http://www.xml-sitemaps.com/) which lets you expose it so that spiders can
use it to be more efficient about searching your site.  More importantly,
search engines use the sitemap to display more info to  your users.  (that's
why http://www.google.com/search?q=robots.txt shows the menu structure of
robotstxt.org right in the search results).

There is one time that spiders *do* request non-existent pages on purpose.
They're testing to make sure your 404 page mechanism is working.  Some web
servers do something stupid with non-existing pages... like redirecting the
user to a non-error page.  If you look at a custom 404 page like
http://www.gocomics.com/does-not-exist you'll see that even though the page
is bright, colorful, and even useful it still gave an HTTP 404 error at the
protocol level.  Some sites do that but return a non-error code in the HTTP
protocol.  To a spider this could be as dangerous as the "infinite calendar"
example above.

You also brought up the issue of brute force password attacks.  Your
password system shouldn't give a 404 error when the user has entered the
wrong password.  401 or other errors are more appropriate (
http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html).  I agree that
blocking that kind of attack is a good thing, but that should be unrelated
to 404s.  (and if you try entering the wrong password on
http://login.yahoo.com, you'll see that you don't get any http error... the
error is part of the UI, not the protocol.)

Not all 404s are bad.  A user accidentally mistyping a URL should get a 404
error and not be punished.  Maybe their finger slipped, eh?  Or maybe you
have a broken link on you web site and they are clicking "reload" thinking
that this will find it.  Therefore, make sure that if you block users that
generate 404 errors, it is because they are using seriously large amount of
resources, not because they have bad typing skills.

Tom
_______________________________________________
Tech mailing list
[email protected]
http://lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to