This is slightly OT, but any solution I use will be mod_perl, of course.

I'm wondering how people deal with spiders.  I don't mind being spidered as
long as it's a well behaved spider and follows robots.txt.  And at this
point I'm not concerned with the load spiders put on the server (and I know
there are modules for dealing with load issues).

But it's amazing how many are just lame in that they take perfectly good
HREF tags and mess them up in the request.  For example, every day I see
many requests from Novell's BorderManager where they forgot to convert HTML
entities in HREFs before making the request.

Here's another example:

64.3.57.99 - "-" [04/Nov/2000:04:36:22 -0800] "GET /../../../ HTTP/1.0" 400
265 "-" "Microsoft Internet Explorer/4.40.426 (Windows 95)" 5740

In the last day that IP has requested about 10,000 documents.  Over half
were 404 requests where some 404s were non-converted entities from HREFs,
but most were just for documents that do not and have never existed on this
site.  Almost 1000 request were 400s (Bad Request like the example above).
And I'd guess that's not really the correct user agent, either....

In general, what I'm interested in stopping are the thousands of requests
for documents that just don't exist on the site.  And to simply block the
lame ones, since they are, well, lame.

Anyway, what do you do with spiders like this, if anything?  Is it even an
issue that you deal with?

Do you use any automated methods to detect spiders, and perhaps block the
lame ones?  I wouldn't want to track every IP, but seems like I could do
well just looking at IPs that have a high proportion of 404s to 200 and
304s and have been requesting over a long period of time, or very frequently.

The reason I'm asking is that I was asked about all the 404s in the web
usage reports.  I know I could post-process the logs before running the web
reports, but it would be much more fun to use mod_perl to catch and block
them on the fly.

BTW -- I have blocked spiders on the fly before -- I used to have a decoy
in robots.txt that, if followed, would add that IP to the blocked list.  It
was interesting to see one spider get caught by that trick because it took
thousands and thousands of 403 errors before that spider got a clue that it
was blocked on every request.

Thanks,


Bill Moseley
mailto:[EMAIL PROTECTED]

Reply via email to