Re: Throttling, once again
> "Christian" == Christian Gilmore <[EMAIL PROTECTED]> writes: Christian> Hi, Drew. >> I came across the very problem you're having. I use mod_bandwidth, its >> actively maintained, allows via IP, directory or any number of ways to >> monitor bandwidth usage http://www.cohprog.com/mod_bandwidth.html Christian> The size of the data sent through the pipe doesn't reflect the CPU spent to Christian> produce that data. mod_bandwidth probably doesn't apply in the current Christian> scenario being discussed. which is why I wrote Stonehenge::Throttle. -- Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095 <[EMAIL PROTECTED]> http://www.stonehenge.com/merlyn/> Perl/Unix/security consulting, Technical writing, Comedy, etc. etc. See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!
RE: Throttling, once again
Hi, Drew. > I came across the very problem you're having. I use mod_bandwidth, its > actively maintained, allows via IP, directory or any number of ways to > monitor bandwidth usage http://www.cohprog.com/mod_bandwidth.html The size of the data sent through the pipe doesn't reflect the CPU spent to produce that data. mod_bandwidth probably doesn't apply in the current scenario being discussed. Thanks, Christian - Christian Gilmore Technology Leader GeT WW Global Applications Development IBM Software Group
RE: Throttling, once again
Hi, Jeremy. > I looked at the page you mentioned below. It wasn't really > clear on the page, but what happens when the requests get above > the max allowed? Are the remaining requests queued or are they > simply given some kind of error message? The service will respond with an HTTP 503 message when the MaxConcurrentReqs number is reached. That tells the browser that the service is temporarily unavailable and to try again later. > There seem to be a number of different modules for this kind of > thing, but most of them seem to be fairly old. We could use a > more currently throttling module that combines what others have > come up with. Age shouldn't matter. If something works as designed, it doesn't need to be updated. :) > For example, the snert.com mod_throttle is nice because it does > it based on IP - but it does it site wide in that mode. This > mod_throttle seems nice because it can be set for an individual > URI...But that's a pain for sites like mine that have 50 or > more intensive scripts (by directory would be nice). And still > both of these approaches don't use cookies like some of the > others to make sure that legit proxies aren't blocked. Well, the design goals of each are probably different. For instance, mod_throttle_access was designed to keep a service healthy, not punish a set of over-zealous users. Blocking by IP doesn't necessarily protect the health of your service. Also, you shouldn't rely on cookies to ensure the health of your service. If someone has cookies disabled, they can defeat your scheme. BTW, mod_throttle_access is a per-directory module (ie, by , , or ), so you can protect an entire tree at once. It will just count that entire tree as one unit during its count toward MaxConcurrentReqs. Regards, Christian - Christian Gilmore Technology Leader GeT WW Global Applications Development IBM Software Group
Re: Throttling, once again
You're assuming that the spider respects cookies. I would not expect this to be the case. Peter Bi wrote: > > How about adding a MD5 watermark for the cookie ? Well, it is becoming > complicated > > Peter Bi > > - Original Message - > From: "kyle dawkins" <[EMAIL PROTECTED]> > To: "Peter Bi" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> > Sent: Friday, April 19, 2002 8:29 AM > Subject: Re: Throttling, once again > > > Peter > > > > Storing the last access time, etc in a cookie won't work for a perl script > > that's abusing your site, or pretty much any spider, or even for anyone > > browsing without cookies, for that matter. -- Steve Piner Web Applications Developer Marketview Limited http://www.marketview.co.nz
Re: Throttling, once again
How about adding a MD5 watermark for the cookie ? Well, it is becoming complicated Peter Bi - Original Message - From: "kyle dawkins" <[EMAIL PROTECTED]> To: "Peter Bi" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Friday, April 19, 2002 8:29 AM Subject: Re: Throttling, once again > Peter > > Storing the last access time, etc in a cookie won't work for a perl script > that's abusing your site, or pretty much any spider, or even for anyone > browsing without cookies, for that matter. > > The hit on the DB is so short and sweet and happens after the response has > been sent to the user so they don't notice any delay and the apache child > takes all of five hundredths of a second more to clean up. > > Kyle Dawkins > Central Park Software > > On Friday 19 April 2002 11:18, Peter Bi wrote: > > If merely the last access time and number of requests within a given time > > interval are needed, I think the fastest way is to record them in a cookie, > > and check them via an access control. Unfortunately, access control is > > called before content handler, so the idea can't be used for CPU or > > bandwidth throttles. In the later cases, one has to call DB/file/memory for > > history. > > > > Peter Bi > > > > > > - Original Message - > > From: "kyle dawkins" <[EMAIL PROTECTED]> > > To: <[EMAIL PROTECTED]> > > Sent: Friday, April 19, 2002 8:02 AM > > Subject: Re: Throttling, once again > > > > > Guys > > > > > > We also have a problem with evil clients. It's not always spiders... in > > > > fact > > > > > more often than not it's some smart-ass with a customised perl script > > > designed to screen-scrape all our data (usually to get email addresses > > > for spam purposes). > > > > > > Our solution, which works pretty well, is to have a LogHandler that > > > checks > > > > the > > > > > IP address of an incoming request and stores some information in the DB > > > > about > > > > > that client; when it was last seen, how many requests it's made in the > > > > past n > > > > > seconds, etc. It means a DB hit on every request but it's pretty light, > > > > all > > > > > things considered. > > > > > > We then have an external process that wakes up every minute or so and > > > > checks > > > > > the DB for badly-behaved clients. If it finds such clients, we get email > > > > and > > > > > the IP is written into a file that is read by mod_rewrite, which sends > > > bad clients to, well, wherever... http://www.microsoft.com is a good one > > > :-) > > > > > > It works great. Of course, mod_throttle sounds pretty cool and maybe > > > I'll test it out on our servers. There are definitely more ways to do > > > this... > > > > > > Which reminds me, you HAVE to make sure that your apache children are > > > size-limited and you have a MaxClients setting where MaxClients * > > > > SizeLimit < > > > > > Free Memory. If you don't, and you get slammed by one of these wankers, > > > > your > > > > > server will swap and then you'll lose all the benefits of shared memory > > > > that > > > > > apache and mod_perl offer us. Check the thread out that was all over the > > > list about a month ago for more information. Basically, avoid swapping > > > > at > > > > > ALL costs. > > > > > > > > > Kyle Dawkins > > > Central Park Software > > > > > > On Friday 19 April 2002 08:55, Marc Slagle wrote: > > > > We never tried mod_throttle, it might be the best solution. Also, one > > > > thing to keep in mind is that some search engines will come from > > > > multiple > > > > > > IP addresses/user-agents at once, making them more difficult to stop. > >
RE: Throttling, once again
I came across the very problem you're having. I use mod_bandwidth, its actively maintained, allows via IP, directory or any number of ways to monitor bandwidth usage http://www.cohprog.com/mod_bandwidth.html Although its not mod_perl related I hope that this helps Drew -Original Message- From: Jeremy Rusnak [mailto:[EMAIL PROTECTED]] Sent: Friday, April 19, 2002 12:06 PM To: Christian Gilmore; [EMAIL PROTECTED] Subject: RE: Throttling, once again Hi, I looked at the page you mentioned below. It wasn't really clear on the page, but what happens when the requests get above the max allowed? Are the remaining requests queued or are they simply given some kind of error message? There seem to be a number of different modules for this kind of thing, but most of them seem to be fairly old. We could use a more currently throttling module that combines what others have come up with. For example, the snert.com mod_throttle is nice because it does it based on IP - but it does it site wide in that mode. This mod_throttle seems nice because it can be set for an individual URI...But that's a pain for sites like mine that have 50 or more intensive scripts (by directory would be nice). And still both of these approaches don't use cookies like some of the others to make sure that legit proxies aren't blocked. Jeremy -Original Message- From: Christian Gilmore [mailto:[EMAIL PROTECTED]] Sent: Friday, April 19, 2002 8:31 AM To: 'Bill Moseley'; [EMAIL PROTECTED] Subject: RE: Throttling, once again Bill, If you're looking to throttle access to a particular URI (or set of URIs), give mod_throttle_access a look. It is available via the Apache Module Registry and at http://www.fremen.org/apache/mod_throttle_access.html . Regards, Christian - Christian Gilmore Technology Leader GeT WW Global Applications Development IBM Software Group -Original Message- From: Bill Moseley [mailto:[EMAIL PROTECTED]] Sent: Friday, April 19, 2002 12:56 AM To: [EMAIL PROTECTED] Subject: Throttling, once again Hi, Wasn't there just a thread on throttling a few weeks ago? I had a machine hit hard yesterday with a spider that ignored robots.txt. Load average was over 90 on a dual CPU Enterprise 3500 running Solaris 2.6. It's a mod_perl server, but has a few CGI scripts that it handles, and the spider was hitting one of the CGI scripts over and over. They were valid requests, but coming in faster than they were going out. Under normal usage the CGI scripts are only accessed a few times a day, so it's not much of a problem have them served by mod_perl. And under normal peak loads RAM is not a problem. The machine also has bandwidth limitation (packet shaper is used to share the bandwidth). That combined with the spider didn't help things. Luckily there's 4GB so even at a load average of 90 it wasn't really swapping much. (Well not when I caught it, anyway). This spider was using the same IP for all requests. Anyway, I remember Randal's Stonehenge::Throttle discussed not too long ago. That seems to address this kind of problem. Is there anything else to look into? Since the front-end is mod_perl, it mean I can use mod_perl throttling solution, too, which is cool. I realize there's some fundamental hardware issues to solve, but if I can just keep the spiders from flooding the machine then the machine is getting by ok. Also, does anyone have suggestions for testing once throttling is in place? I don't want to start cutting off the good customers, but I do want to get an idea how it acts under load. ab to the rescue, I suppose. Thanks much, -- Bill Moseley mailto:[EMAIL PROTECTED]
RE: Throttling, once again
Hi, I looked at the page you mentioned below. It wasn't really clear on the page, but what happens when the requests get above the max allowed? Are the remaining requests queued or are they simply given some kind of error message? There seem to be a number of different modules for this kind of thing, but most of them seem to be fairly old. We could use a more currently throttling module that combines what others have come up with. For example, the snert.com mod_throttle is nice because it does it based on IP - but it does it site wide in that mode. This mod_throttle seems nice because it can be set for an individual URI...But that's a pain for sites like mine that have 50 or more intensive scripts (by directory would be nice). And still both of these approaches don't use cookies like some of the others to make sure that legit proxies aren't blocked. Jeremy -Original Message- From: Christian Gilmore [mailto:[EMAIL PROTECTED]] Sent: Friday, April 19, 2002 8:31 AM To: 'Bill Moseley'; [EMAIL PROTECTED] Subject: RE: Throttling, once again Bill, If you're looking to throttle access to a particular URI (or set of URIs), give mod_throttle_access a look. It is available via the Apache Module Registry and at http://www.fremen.org/apache/mod_throttle_access.html . Regards, Christian - Christian Gilmore Technology Leader GeT WW Global Applications Development IBM Software Group -Original Message- From: Bill Moseley [mailto:[EMAIL PROTECTED]] Sent: Friday, April 19, 2002 12:56 AM To: [EMAIL PROTECTED] Subject: Throttling, once again Hi, Wasn't there just a thread on throttling a few weeks ago? I had a machine hit hard yesterday with a spider that ignored robots.txt. Load average was over 90 on a dual CPU Enterprise 3500 running Solaris 2.6. It's a mod_perl server, but has a few CGI scripts that it handles, and the spider was hitting one of the CGI scripts over and over. They were valid requests, but coming in faster than they were going out. Under normal usage the CGI scripts are only accessed a few times a day, so it's not much of a problem have them served by mod_perl. And under normal peak loads RAM is not a problem. The machine also has bandwidth limitation (packet shaper is used to share the bandwidth). That combined with the spider didn't help things. Luckily there's 4GB so even at a load average of 90 it wasn't really swapping much. (Well not when I caught it, anyway). This spider was using the same IP for all requests. Anyway, I remember Randal's Stonehenge::Throttle discussed not too long ago. That seems to address this kind of problem. Is there anything else to look into? Since the front-end is mod_perl, it mean I can use mod_perl throttling solution, too, which is cool. I realize there's some fundamental hardware issues to solve, but if I can just keep the spiders from flooding the machine then the machine is getting by ok. Also, does anyone have suggestions for testing once throttling is in place? I don't want to start cutting off the good customers, but I do want to get an idea how it acts under load. ab to the rescue, I suppose. Thanks much, -- Bill Moseley mailto:[EMAIL PROTECTED]
RE: Throttling, once again
On 19-Apr-2002 Bill Moseley wrote: > Also, does anyone have suggestions for testing once throttling is in place? > I don't want to start cutting off the good customers, but I do want to get > an idea how it acts under load. ab to the rescue, I suppose. wget supports recursive spidering. Or try IE or another browser that will save an entire site to disk. The scripts that the cherry pickers use to garner email addresses are probably all over the net too... for i in `seq 1 10`; do cd /tmp mkdir wget$i cd wget$i wget -rb http://localhost/wiki done Load on my server immediately jumped from < 1 to 12 when I did this. Wim
Re: Throttling, once again
Hi Bill, > Wasn't there just a thread on throttling a few weeks ago? There have been many. Here's my answer to one of them: 004101c0f2cc$9d14a540$[EMAIL PROTECTED]">http://mathforum.org/epigone/modperl/blexblolgang/004101c0f2cc$9d14a540$[EMAIL PROTECTED] > Anyway, I remember Randal's Stonehenge::Throttle discussed not too long > ago. What we did was take that module, replace the CPU monitoring stuff with simple hit counting, and make it just return an "Access Denied" when the limit is hit. We also did some slightly fancier stuff than just blocking by IP, since we were concerned about proxy servers getting blocked. You may not need to bother with that though. One nice thing about Randal's module is that it uses disk (not shared memory) and worked well across a cluster sharing the directory over NFS. > Also, does anyone have suggestions for testing once throttling is in place? > I don't want to start cutting off the good customers, but I do want to get > an idea how it acts under load. ab to the rescue, I suppose. That will work, as will http_load or httperf. - Perrin
RE: Throttling, once again
Bill, If you're looking to throttle access to a particular URI (or set of URIs), give mod_throttle_access a look. It is available via the Apache Module Registry and at http://www.fremen.org/apache/mod_throttle_access.html . Regards, Christian - Christian Gilmore Technology Leader GeT WW Global Applications Development IBM Software Group -Original Message- From: Bill Moseley [mailto:[EMAIL PROTECTED]] Sent: Friday, April 19, 2002 12:56 AM To: [EMAIL PROTECTED] Subject: Throttling, once again Hi, Wasn't there just a thread on throttling a few weeks ago? I had a machine hit hard yesterday with a spider that ignored robots.txt. Load average was over 90 on a dual CPU Enterprise 3500 running Solaris 2.6. It's a mod_perl server, but has a few CGI scripts that it handles, and the spider was hitting one of the CGI scripts over and over. They were valid requests, but coming in faster than they were going out. Under normal usage the CGI scripts are only accessed a few times a day, so it's not much of a problem have them served by mod_perl. And under normal peak loads RAM is not a problem. The machine also has bandwidth limitation (packet shaper is used to share the bandwidth). That combined with the spider didn't help things. Luckily there's 4GB so even at a load average of 90 it wasn't really swapping much. (Well not when I caught it, anyway). This spider was using the same IP for all requests. Anyway, I remember Randal's Stonehenge::Throttle discussed not too long ago. That seems to address this kind of problem. Is there anything else to look into? Since the front-end is mod_perl, it mean I can use mod_perl throttling solution, too, which is cool. I realize there's some fundamental hardware issues to solve, but if I can just keep the spiders from flooding the machine then the machine is getting by ok. Also, does anyone have suggestions for testing once throttling is in place? I don't want to start cutting off the good customers, but I do want to get an idea how it acts under load. ab to the rescue, I suppose. Thanks much, -- Bill Moseley mailto:[EMAIL PROTECTED]
Re: Throttling, once again
Peter Storing the last access time, etc in a cookie won't work for a perl script that's abusing your site, or pretty much any spider, or even for anyone browsing without cookies, for that matter. The hit on the DB is so short and sweet and happens after the response has been sent to the user so they don't notice any delay and the apache child takes all of five hundredths of a second more to clean up. Kyle Dawkins Central Park Software On Friday 19 April 2002 11:18, Peter Bi wrote: > If merely the last access time and number of requests within a given time > interval are needed, I think the fastest way is to record them in a cookie, > and check them via an access control. Unfortunately, access control is > called before content handler, so the idea can't be used for CPU or > bandwidth throttles. In the later cases, one has to call DB/file/memory for > history. > > Peter Bi > > > - Original Message - > From: "kyle dawkins" <[EMAIL PROTECTED]> > To: <[EMAIL PROTECTED]> > Sent: Friday, April 19, 2002 8:02 AM > Subject: Re: Throttling, once again > > > Guys > > > > We also have a problem with evil clients. It's not always spiders... in > > fact > > > more often than not it's some smart-ass with a customised perl script > > designed to screen-scrape all our data (usually to get email addresses > > for spam purposes). > > > > Our solution, which works pretty well, is to have a LogHandler that > > checks > > the > > > IP address of an incoming request and stores some information in the DB > > about > > > that client; when it was last seen, how many requests it's made in the > > past n > > > seconds, etc. It means a DB hit on every request but it's pretty light, > > all > > > things considered. > > > > We then have an external process that wakes up every minute or so and > > checks > > > the DB for badly-behaved clients. If it finds such clients, we get email > > and > > > the IP is written into a file that is read by mod_rewrite, which sends > > bad clients to, well, wherever... http://www.microsoft.com is a good one > > :-) > > > > It works great. Of course, mod_throttle sounds pretty cool and maybe > > I'll test it out on our servers. There are definitely more ways to do > > this... > > > > Which reminds me, you HAVE to make sure that your apache children are > > size-limited and you have a MaxClients setting where MaxClients * > > SizeLimit < > > > Free Memory. If you don't, and you get slammed by one of these wankers, > > your > > > server will swap and then you'll lose all the benefits of shared memory > > that > > > apache and mod_perl offer us. Check the thread out that was all over the > > list about a month ago for more information. Basically, avoid swapping > > at > > > ALL costs. > > > > > > Kyle Dawkins > > Central Park Software > > > > On Friday 19 April 2002 08:55, Marc Slagle wrote: > > > We never tried mod_throttle, it might be the best solution. Also, one > > > thing to keep in mind is that some search engines will come from > > multiple > > > > IP addresses/user-agents at once, making them more difficult to stop.
Re: Throttling, once again
If merely the last access time and number of requests within a given time interval are needed, I think the fastest way is to record them in a cookie, and check them via an access control. Unfortunately, access control is called before content handler, so the idea can't be used for CPU or bandwidth throttles. In the later cases, one has to call DB/file/memory for history. Peter Bi - Original Message - From: "kyle dawkins" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Friday, April 19, 2002 8:02 AM Subject: Re: Throttling, once again > Guys > > We also have a problem with evil clients. It's not always spiders... in fact > more often than not it's some smart-ass with a customised perl script > designed to screen-scrape all our data (usually to get email addresses for > spam purposes). > > Our solution, which works pretty well, is to have a LogHandler that checks the > IP address of an incoming request and stores some information in the DB about > that client; when it was last seen, how many requests it's made in the past n > seconds, etc. It means a DB hit on every request but it's pretty light, all > things considered. > > We then have an external process that wakes up every minute or so and checks > the DB for badly-behaved clients. If it finds such clients, we get email and > the IP is written into a file that is read by mod_rewrite, which sends bad > clients to, well, wherever... http://www.microsoft.com is a good one :-) > > It works great. Of course, mod_throttle sounds pretty cool and maybe I'll > test it out on our servers. There are definitely more ways to do this... > > Which reminds me, you HAVE to make sure that your apache children are > size-limited and you have a MaxClients setting where MaxClients * SizeLimit < > Free Memory. If you don't, and you get slammed by one of these wankers, your > server will swap and then you'll lose all the benefits of shared memory that > apache and mod_perl offer us. Check the thread out that was all over the > list about a month ago for more information. Basically, avoid swapping at > ALL costs. > > > Kyle Dawkins > Central Park Software > > On Friday 19 April 2002 08:55, Marc Slagle wrote: > > We never tried mod_throttle, it might be the best solution. Also, one > > thing to keep in mind is that some search engines will come from multiple > > IP addresses/user-agents at once, making them more difficult to stop. > >
Re: Throttling, once again
Guys We also have a problem with evil clients. It's not always spiders... in fact more often than not it's some smart-ass with a customised perl script designed to screen-scrape all our data (usually to get email addresses for spam purposes). Our solution, which works pretty well, is to have a LogHandler that checks the IP address of an incoming request and stores some information in the DB about that client; when it was last seen, how many requests it's made in the past n seconds, etc. It means a DB hit on every request but it's pretty light, all things considered. We then have an external process that wakes up every minute or so and checks the DB for badly-behaved clients. If it finds such clients, we get email and the IP is written into a file that is read by mod_rewrite, which sends bad clients to, well, wherever... http://www.microsoft.com is a good one :-) It works great. Of course, mod_throttle sounds pretty cool and maybe I'll test it out on our servers. There are definitely more ways to do this... Which reminds me, you HAVE to make sure that your apache children are size-limited and you have a MaxClients setting where MaxClients * SizeLimit < Free Memory. If you don't, and you get slammed by one of these wankers, your server will swap and then you'll lose all the benefits of shared memory that apache and mod_perl offer us. Check the thread out that was all over the list about a month ago for more information. Basically, avoid swapping at ALL costs. Kyle Dawkins Central Park Software On Friday 19 April 2002 08:55, Marc Slagle wrote: > We never tried mod_throttle, it might be the best solution. Also, one > thing to keep in mind is that some search engines will come from multiple > IP addresses/user-agents at once, making them more difficult to stop.
Re: Throttling, once again
When this happened to our clients servers we ended up trying some of the mod_perl based solutions. We tried some of the modules that used shared memory, but the traffic on our site quickly filled our shared memory and made the module unuseable. After that we tried blocking the agents altogether, and there is example code in the Eagle book (Apache::BlockAgent) that worked pretty well. You might be able to place some of that code in your CGI, denying the search engines agents/IPs from accessing it, while allowing real users in. That way the search engines can still get static pages. We never tried mod_throttle, it might be the best solution. Also, one thing to keep in mind is that some search engines will come from multiple IP addresses/user-agents at once, making them more difficult to stop. > Hi, > > Wasn't there just a thread on throttling a few weeks ago? > > I had a machine hit hard yesterday with a spider that ignored robots.txt. > > Load average was over 90 on a dual CPU Enterprise 3500 running Solaris 2.6. > It's a mod_perl server, but has a few CGI scripts that it handles, and the > spider was hitting one of the CGI scripts over and over. They were valid > requests, but coming in faster than they were going out. > > Under normal usage the CGI scripts are only accessed a few times a day, so > it's not much of a problem have them served by mod_perl. And under normal > peak loads RAM is not a problem. > > The machine also has bandwidth limitation (packet shaper is used to share > the bandwidth). That combined with the spider didn't help things. Luckily > there's 4GB so even at a load average of 90 it wasn't really swapping much. > (Well not when I caught it, anyway). This spider was using the same IP > for all requests. > > Anyway, I remember Randal's Stonehenge::Throttle discussed not too long > ago. That seems to address this kind of problem. Is there anything else > to look into? Since the front-end is mod_perl, it mean I can use mod_perl > throttling solution, too, which is cool. > > I realize there's some fundamental hardware issues to solve, but if I can > just keep the spiders from flooding the machine then the machine is getting > by ok. > > Also, does anyone have suggestions for testing once throttling is in place? > I don't want to start cutting off the good customers, but I do want to get > an idea how it acts under load. ab to the rescue, I suppose. > > Thanks much, > > > -- > Bill Moseley > mailto:[EMAIL PROTECTED] >
RE: Throttling, once again
Hi, I *HIGHLY* recommend mod_throttle for Apache. It is very configurable. You can get the software at http://www.snert.com/Software/mod_throttle/index.shtml . The best thing about it is the ability to throttle based on bandwidth and client IP. We had problems with robots as well as malicious end users who would flood our server with requests. mod_throttle allows you to set up rules to prevent one IP address from making more than x requests for the same document in y time period. Our mod_perl servers, for example, track the last 50 client IPs. If one of those clients goes about 50 requests, it is blocked out. The last client that requests a document is put at the top of the list, so even very active legit users tend to fall off the bottom, but things like robots stay blocked. I highly recommend you look into it. We were doing some custom writting functions to block this kind of thing, but the Apache module makes it so much nicer. Jeremy -Original Message- From: Bill Moseley [mailto:[EMAIL PROTECTED]] Sent: Thursday, April 18, 2002 10:56 PM To: [EMAIL PROTECTED] Subject: Throttling, once again Hi, Wasn't there just a thread on throttling a few weeks ago? I had a machine hit hard yesterday with a spider that ignored robots.txt. Load average was over 90 on a dual CPU Enterprise 3500 running Solaris 2.6. It's a mod_perl server, but has a few CGI scripts that it handles, and the spider was hitting one of the CGI scripts over and over. They were valid requests, but coming in faster than they were going out. Under normal usage the CGI scripts are only accessed a few times a day, so it's not much of a problem have them served by mod_perl. And under normal peak loads RAM is not a problem. The machine also has bandwidth limitation (packet shaper is used to share the bandwidth). That combined with the spider didn't help things. Luckily there's 4GB so even at a load average of 90 it wasn't really swapping much. (Well not when I caught it, anyway). This spider was using the same IP for all requests. Anyway, I remember Randal's Stonehenge::Throttle discussed not too long ago. That seems to address this kind of problem. Is there anything else to look into? Since the front-end is mod_perl, it mean I can use mod_perl throttling solution, too, which is cool. I realize there's some fundamental hardware issues to solve, but if I can just keep the spiders from flooding the machine then the machine is getting by ok. Also, does anyone have suggestions for testing once throttling is in place? I don't want to start cutting off the good customers, but I do want to get an idea how it acts under load. ab to the rescue, I suppose. Thanks much, -- Bill Moseley mailto:[EMAIL PROTECTED]
Re: Throttling, once again
On Friday 19 April 2002 6:55 am, Bill Moseley wrote: > Hi, > > Wasn't there just a thread on throttling a few weeks ago? > > I had a machine hit hard yesterday with a spider that ignored robots.txt. I thought the standard practice these days was to put some URL at an un-reachable place (by a human), for example using something like . And then ban that via robots.txt. And then automatically update your routing tables for any IP addresses that try and visit that URL. Just a thought, there's probably more to it. Matt.