Hi,

I had the same thing, sometimes the spiders are programmed VERY sloppy. I had a
site that responed to ANY request made to its location. The mayoraty of spiders
does not understand about single and double qoutes or if you leave quotes out of
your HREF's at all. also I understand that absolute href="/bla" and relative
href="../bla" are also a problem.

Those spiders would simply start getting urls like GET
/foo/file=1243/date=12-30-2000/name=foobar'/foo/file=1243/date=12-30-2000/name=foobar

or
GET ../bla'
or
GET ../bla/'../bla'../bla'
aso...

then that page would generate a page with a load of faulty links that would also
be followed.
alle HREF got built on the basis of the data that were in the requested URL.

Then other spiders got those faulty links from eachother and soon I got more
traffic from spiders trying to index faulty links than from regular visitors. :)

What I did was to check the input for a particular url and see if it was correct.
(should have done that in the first place.) Then I 404red the bastards.... I am
now redirecting them to the main page, which looks nicer on yer logs too. Plus
the spider might be tempted to spider yer page regularly. (most spiders drop
redirects.) You could also just return a plaintext OK. lots of nice 200's in yer
stats...
Another solution I have seen is returning a doorway page to your site.
(Searchengine SPAM!) Thats hittingthem back where it hurts. :)

I've made remarks about this to the owners of those spiders (excite/altavista)
but I have had no satisfactory responses from them.

What we could do as a community is create spiderlawenforcement.org, a centralized
database where we keep track of spiders and how they index our sites. We could
build a database of spiders indexed by Agent tag, those following robots.txt and
those explicitly exploiting that, or blacklist some by IP if they keep breaking
the rules. Lots of developers could use this database to block those nasty sons
of.... er well, sons of spiders I suppose. All opensourced of course, and the
data available for free, some perl modules to approach the db. Send an email to
the administrator of the spider everytime a spider tries a bad link on a member
site, and watch how fast thell fix the bl**dy things!

Let me know if any of you are interrested in such a thing.



Bill Moseley wrote:

> This is slightly OT, but any solution I use will be mod_perl, of course.
>
> I'm wondering how people deal with spiders.  I don't mind being spidered as
> long as it's a well behaved spider and follows robots.txt.  And at this
> point I'm not concerned with the load spiders put on the server (and I know
> there are modules for dealing with load issues).
>
> But it's amazing how many are just lame in that they take perfectly good
> HREF tags and mess them up in the request.  For example, every day I see
> many requests from Novell's BorderManager where they forgot to convert HTML
> entities in HREFs before making the request.
>
> Here's another example:
>
> 64.3.57.99 - "-" [04/Nov/2000:04:36:22 -0800] "GET /../../../ HTTP/1.0" 400
> 265 "-" "Microsoft Internet Explorer/4.40.426 (Windows 95)" 5740
>
> In the last day that IP has requested about 10,000 documents.  Over half
> were 404 requests where some 404s were non-converted entities from HREFs,
> but most were just for documents that do not and have never existed on this
> site.  Almost 1000 request were 400s (Bad Request like the example above).
> And I'd guess that's not really the correct user agent, either....
>
> In general, what I'm interested in stopping are the thousands of requests
> for documents that just don't exist on the site.  And to simply block the
> lame ones, since they are, well, lame.
>
> Anyway, what do you do with spiders like this, if anything?  Is it even an
> issue that you deal with?
>
> Do you use any automated methods to detect spiders, and perhaps block the
> lame ones?  I wouldn't want to track every IP, but seems like I could do
> well just looking at IPs that have a high proportion of 404s to 200 and
> 304s and have been requesting over a long period of time, or very frequently.
>
> The reason I'm asking is that I was asked about all the 404s in the web
> usage reports.  I know I could post-process the logs before running the web
> reports, but it would be much more fun to use mod_perl to catch and block
> them on the fly.
>
> BTW -- I have blocked spiders on the fly before -- I used to have a decoy
> in robots.txt that, if followed, would add that IP to the blocked list.  It
> was interesting to see one spider get caught by that trick because it took
> thousands and thousands of 403 errors before that spider got a clue that it
> was blocked on every request.
>
> Thanks,
>
> Bill Moseley
> mailto:[EMAIL PROTECTED]

--
Yours sincerely,
Met vriendelijke groeten,


Marko van der Puil http://www.renesse.com
   [EMAIL PROTECTED]


Reply via email to