Guys

We also have a problem with evil clients. It's not always spiders... in fact 
more often than not it's some smart-ass with a customised perl script 
designed to screen-scrape all our data (usually to get email addresses for 
spam purposes).

Our solution, which works pretty well, is to have a LogHandler that checks the 
IP address of an incoming request and stores some information in the DB about 
that client; when it was last seen, how many requests it's made in the past n 
seconds, etc.  It means a DB hit on every request but it's pretty light, all 
things considered.

We then have an external process that wakes up every minute or so and checks 
the DB for badly-behaved clients.  If it finds such clients, we get email and 
the IP is written into a file that is read by mod_rewrite, which sends bad 
clients to, well, wherever... http://www.microsoft.com is a good one :-)

It works great.  Of course, mod_throttle sounds pretty cool and maybe I'll 
test it out on our servers.  There are definitely more ways to do this...

Which reminds me, you HAVE to make sure that your apache children are 
size-limited and you have a MaxClients setting where MaxClients * SizeLimit < 
Free Memory.  If you don't, and you get slammed by one of these wankers, your 
server will swap and then you'll lose all the benefits of shared memory that 
apache and mod_perl offer us.  Check the thread out that was all over the 
list about a  month ago for more information.  Basically, avoid swapping at 
ALL costs.


Kyle Dawkins
Central Park Software

On Friday 19 April 2002 08:55, Marc Slagle wrote:
> We never tried mod_throttle, it might be the best solution.  Also, one
> thing to keep in mind is that some search engines will come from multiple
> IP addresses/user-agents at once, making them more difficult to stop.

Reply via email to