Re: Throttling, once again

2002-04-25 Thread Randal L. Schwartz

 Christian == Christian Gilmore [EMAIL PROTECTED] writes:

Christian Hi, Drew.
 I came across the very problem you're having. I use mod_bandwidth, its
 actively maintained, allows via IP, directory or any number of ways to
 monitor bandwidth usage http://www.cohprog.com/mod_bandwidth.html

Christian The size of the data sent through the pipe doesn't reflect the CPU spent to
Christian produce that data. mod_bandwidth probably doesn't apply in the current
Christian scenario being discussed.

 which is why I wrote Stonehenge::Throttle.

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
[EMAIL PROTECTED] URL:http://www.stonehenge.com/merlyn/
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!



RE: Throttling, once again

2002-04-22 Thread Christian Gilmore

Hi, Jeremy.

 I looked at the page you mentioned below.  It wasn't really
 clear on the page, but what happens when the requests get above
 the max allowed?  Are the remaining requests queued or are they
 simply given some kind of error message?

The service will respond with an HTTP 503 message when the MaxConcurrentReqs
number is reached. That tells the browser that the service is temporarily
unavailable and to try again later.

 There seem to be a number of different modules for this kind of
 thing, but most of them seem to be fairly old.  We could use a
 more currently throttling module that combines what others have
 come up with.

Age shouldn't matter. If something works as designed, it doesn't need to be
updated. :)

 For example, the snert.com mod_throttle is nice because it does
 it based on IP - but it does it site wide in that mode.  This
 mod_throttle seems nice because it can be set for an individual
 URI...But that's a pain for sites like mine that have 50 or
 more intensive scripts (by directory would be nice).  And still
 both of these approaches don't use cookies like some of the
 others to make sure that legit proxies aren't blocked.

Well, the design goals of each are probably different. For instance,
mod_throttle_access was designed to keep a service healthy, not punish a set
of over-zealous users. Blocking by IP doesn't necessarily protect the health
of your service. Also, you shouldn't rely on cookies to ensure the health of
your service. If someone has cookies disabled, they can defeat your scheme.

BTW, mod_throttle_access is a per-directory module (ie, by Directory,
Location, or Files), so you can protect an entire tree at once. It will
just count that entire tree as one unit during its count toward
MaxConcurrentReqs.

Regards,
Christian

-
Christian Gilmore
Technology Leader
GeT WW Global Applications Development
IBM Software Group





RE: Throttling, once again

2002-04-22 Thread Christian Gilmore

Hi, Drew.

 I came across the very problem you're having. I use mod_bandwidth, its
 actively maintained, allows via IP, directory or any number of ways to
 monitor bandwidth usage http://www.cohprog.com/mod_bandwidth.html

The size of the data sent through the pipe doesn't reflect the CPU spent to
produce that data. mod_bandwidth probably doesn't apply in the current
scenario being discussed.

Thanks,
Christian

-
Christian Gilmore
Technology Leader
GeT WW Global Applications Development
IBM Software Group





RE: Throttling, once again

2002-04-19 Thread Jeremy Rusnak

Hi,

I *HIGHLY* recommend mod_throttle for Apache.  It is very
configurable.  You can get the software at
http://www.snert.com/Software/mod_throttle/index.shtml .

The best thing about it is the ability to throttle based
on bandwidth and client IP.  We had problems with robots
as well as malicious end users who would flood our
server with requests.

mod_throttle allows you to set up rules to prevent one
IP address from making more than x requests for the
same document in y time period.  Our mod_perl servers,
for example, track the last 50 client IPs.  If one of
those clients goes about 50 requests, it is blocked
out.  The last client that requests a document is put
at the top of the list, so even very active legit users
tend to fall off the bottom, but things like robots
stay blocked.

I highly recommend you look into it.  We were doing some
custom writting functions to block this kind of thing,
but the Apache module makes it so much nicer.

Jeremy

-Original Message-
From: Bill Moseley [mailto:[EMAIL PROTECTED]]
Sent: Thursday, April 18, 2002 10:56 PM
To: [EMAIL PROTECTED]
Subject: Throttling, once again


Hi,

Wasn't there just a thread on throttling a few weeks ago?

I had a machine hit hard yesterday with a spider that ignored robots.txt.  

Load average was over 90 on a dual CPU Enterprise 3500 running Solaris 2.6.
 It's a mod_perl server, but has a few CGI scripts that it handles, and the
spider was hitting one of the CGI scripts over and over.  They were valid
requests, but coming in faster than they were going out.

Under normal usage the CGI scripts are only accessed a few times a day, so
it's not much of a problem have them served by mod_perl.  And under normal
peak loads RAM is not a problem.  

The machine also has bandwidth limitation (packet shaper is used to share
the bandwidth).  That combined with the spider didn't help things.  Luckily
there's 4GB so even at a load average of 90 it wasn't really swapping much.
 (Well not when I caught it, anyway).  This spider was using the same IP
for all requests.

Anyway, I remember Randal's Stonehenge::Throttle discussed not too long
ago.  That seems to address this kind of problem.  Is there anything else
to look into?  Since the front-end is mod_perl, it mean I can use mod_perl
throttling solution, too, which is cool.

I realize there's some fundamental hardware issues to solve, but if I can
just keep the spiders from flooding the machine then the machine is getting
by ok.

Also, does anyone have suggestions for testing once throttling is in place?
 I don't want to start cutting off the good customers, but I do want to get
an idea how it acts under load.  ab to the rescue, I suppose.

Thanks much,


-- 
Bill Moseley
mailto:[EMAIL PROTECTED]




Re: Throttling, once again

2002-04-19 Thread Marc Slagle

When this happened to our clients servers we ended up trying some of the
mod_perl based solutions.  We tried some of the modules that used shared
memory, but the traffic on our site quickly filled our shared memory and
made the module unuseable.  After that we tried blocking the agents
altogether, and there is example code in the Eagle book (Apache::BlockAgent)
that worked pretty well.

You might be able to place some of that code in your CGI, denying the search
engines agents/IPs from accessing it, while allowing real users in.  That
way the search engines can still get static pages.

We never tried mod_throttle, it might be the best solution.  Also, one thing
to keep in mind is that some search engines will come from multiple IP
addresses/user-agents at once, making them more difficult to stop.

 Hi,

 Wasn't there just a thread on throttling a few weeks ago?

 I had a machine hit hard yesterday with a spider that ignored robots.txt.

 Load average was over 90 on a dual CPU Enterprise 3500 running Solaris
2.6.
  It's a mod_perl server, but has a few CGI scripts that it handles, and
the
 spider was hitting one of the CGI scripts over and over.  They were valid
 requests, but coming in faster than they were going out.

 Under normal usage the CGI scripts are only accessed a few times a day, so
 it's not much of a problem have them served by mod_perl.  And under normal
 peak loads RAM is not a problem.

 The machine also has bandwidth limitation (packet shaper is used to share
 the bandwidth).  That combined with the spider didn't help things.
Luckily
 there's 4GB so even at a load average of 90 it wasn't really swapping
much.
  (Well not when I caught it, anyway).  This spider was using the same IP
 for all requests.

 Anyway, I remember Randal's Stonehenge::Throttle discussed not too long
 ago.  That seems to address this kind of problem.  Is there anything else
 to look into?  Since the front-end is mod_perl, it mean I can use mod_perl
 throttling solution, too, which is cool.

 I realize there's some fundamental hardware issues to solve, but if I can
 just keep the spiders from flooding the machine then the machine is
getting
 by ok.

 Also, does anyone have suggestions for testing once throttling is in
place?
  I don't want to start cutting off the good customers, but I do want to
get
 an idea how it acts under load.  ab to the rescue, I suppose.

 Thanks much,


 --
 Bill Moseley
 mailto:[EMAIL PROTECTED]





Re: Throttling, once again

2002-04-19 Thread kyle dawkins

Guys

We also have a problem with evil clients. It's not always spiders... in fact 
more often than not it's some smart-ass with a customised perl script 
designed to screen-scrape all our data (usually to get email addresses for 
spam purposes).

Our solution, which works pretty well, is to have a LogHandler that checks the 
IP address of an incoming request and stores some information in the DB about 
that client; when it was last seen, how many requests it's made in the past n 
seconds, etc.  It means a DB hit on every request but it's pretty light, all 
things considered.

We then have an external process that wakes up every minute or so and checks 
the DB for badly-behaved clients.  If it finds such clients, we get email and 
the IP is written into a file that is read by mod_rewrite, which sends bad 
clients to, well, wherever... http://www.microsoft.com is a good one :-)

It works great.  Of course, mod_throttle sounds pretty cool and maybe I'll 
test it out on our servers.  There are definitely more ways to do this...

Which reminds me, you HAVE to make sure that your apache children are 
size-limited and you have a MaxClients setting where MaxClients * SizeLimit  
Free Memory.  If you don't, and you get slammed by one of these wankers, your 
server will swap and then you'll lose all the benefits of shared memory that 
apache and mod_perl offer us.  Check the thread out that was all over the 
list about a  month ago for more information.  Basically, avoid swapping at 
ALL costs.


Kyle Dawkins
Central Park Software

On Friday 19 April 2002 08:55, Marc Slagle wrote:
 We never tried mod_throttle, it might be the best solution.  Also, one
 thing to keep in mind is that some search engines will come from multiple
 IP addresses/user-agents at once, making them more difficult to stop.




Re: Throttling, once again

2002-04-19 Thread Peter Bi

If merely the last access time and number of requests within a given time
interval are needed, I think the fastest way is to record them in a cookie,
and check them via an access control. Unfortunately, access control is
called before content handler, so the idea can't be used for CPU or
bandwidth throttles. In the later cases, one has to call DB/file/memory for
history.

Peter Bi


- Original Message -
From: kyle dawkins [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Friday, April 19, 2002 8:02 AM
Subject: Re: Throttling, once again


 Guys

 We also have a problem with evil clients. It's not always spiders... in
fact
 more often than not it's some smart-ass with a customised perl script
 designed to screen-scrape all our data (usually to get email addresses for
 spam purposes).

 Our solution, which works pretty well, is to have a LogHandler that checks
the
 IP address of an incoming request and stores some information in the DB
about
 that client; when it was last seen, how many requests it's made in the
past n
 seconds, etc.  It means a DB hit on every request but it's pretty light,
all
 things considered.

 We then have an external process that wakes up every minute or so and
checks
 the DB for badly-behaved clients.  If it finds such clients, we get email
and
 the IP is written into a file that is read by mod_rewrite, which sends bad
 clients to, well, wherever... http://www.microsoft.com is a good one :-)

 It works great.  Of course, mod_throttle sounds pretty cool and maybe I'll
 test it out on our servers.  There are definitely more ways to do this...

 Which reminds me, you HAVE to make sure that your apache children are
 size-limited and you have a MaxClients setting where MaxClients *
SizeLimit 
 Free Memory.  If you don't, and you get slammed by one of these wankers,
your
 server will swap and then you'll lose all the benefits of shared memory
that
 apache and mod_perl offer us.  Check the thread out that was all over the
 list about a  month ago for more information.  Basically, avoid swapping
at
 ALL costs.


 Kyle Dawkins
 Central Park Software

 On Friday 19 April 2002 08:55, Marc Slagle wrote:
  We never tried mod_throttle, it might be the best solution.  Also, one
  thing to keep in mind is that some search engines will come from
multiple
  IP addresses/user-agents at once, making them more difficult to stop.






RE: Throttling, once again

2002-04-19 Thread Christian Gilmore

Bill,

If you're looking to throttle access to a particular URI (or set of URIs),
give mod_throttle_access a look. It is available via the Apache Module
Registry and at http://www.fremen.org/apache/mod_throttle_access.html .

Regards,
Christian

-
Christian Gilmore
Technology Leader
GeT WW Global Applications Development
IBM Software Group


-Original Message-
From: Bill Moseley [mailto:[EMAIL PROTECTED]]
Sent: Friday, April 19, 2002 12:56 AM
To: [EMAIL PROTECTED]
Subject: Throttling, once again


Hi,

Wasn't there just a thread on throttling a few weeks ago?

I had a machine hit hard yesterday with a spider that ignored robots.txt.

Load average was over 90 on a dual CPU Enterprise 3500 running Solaris 2.6.
 It's a mod_perl server, but has a few CGI scripts that it handles, and the
spider was hitting one of the CGI scripts over and over.  They were valid
requests, but coming in faster than they were going out.

Under normal usage the CGI scripts are only accessed a few times a day, so
it's not much of a problem have them served by mod_perl.  And under normal
peak loads RAM is not a problem.

The machine also has bandwidth limitation (packet shaper is used to share
the bandwidth).  That combined with the spider didn't help things.  Luckily
there's 4GB so even at a load average of 90 it wasn't really swapping much.
 (Well not when I caught it, anyway).  This spider was using the same IP
for all requests.

Anyway, I remember Randal's Stonehenge::Throttle discussed not too long
ago.  That seems to address this kind of problem.  Is there anything else
to look into?  Since the front-end is mod_perl, it mean I can use mod_perl
throttling solution, too, which is cool.

I realize there's some fundamental hardware issues to solve, but if I can
just keep the spiders from flooding the machine then the machine is getting
by ok.

Also, does anyone have suggestions for testing once throttling is in place?
 I don't want to start cutting off the good customers, but I do want to get
an idea how it acts under load.  ab to the rescue, I suppose.

Thanks much,


--
Bill Moseley
mailto:[EMAIL PROTECTED]




RE: Throttling, once again

2002-04-19 Thread Jeremy Rusnak

Hi,

I looked at the page you mentioned below.  It wasn't really
clear on the page, but what happens when the requests get above
the max allowed?  Are the remaining requests queued or are they
simply given some kind of error message?

There seem to be a number of different modules for this kind of
thing, but most of them seem to be fairly old.  We could use a
more currently throttling module that combines what others have
come up with.  

For example, the snert.com mod_throttle is nice because it does
it based on IP - but it does it site wide in that mode.  This
mod_throttle seems nice because it can be set for an individual
URI...But that's a pain for sites like mine that have 50 or
more intensive scripts (by directory would be nice).  And still
both of these approaches don't use cookies like some of the
others to make sure that legit proxies aren't blocked.

Jeremy

-Original Message-
From: Christian Gilmore [mailto:[EMAIL PROTECTED]]
Sent: Friday, April 19, 2002 8:31 AM
To: 'Bill Moseley'; [EMAIL PROTECTED]
Subject: RE: Throttling, once again


Bill,

If you're looking to throttle access to a particular URI (or set of URIs),
give mod_throttle_access a look. It is available via the Apache Module
Registry and at http://www.fremen.org/apache/mod_throttle_access.html .

Regards,
Christian

-
Christian Gilmore
Technology Leader
GeT WW Global Applications Development
IBM Software Group


-Original Message-
From: Bill Moseley [mailto:[EMAIL PROTECTED]]
Sent: Friday, April 19, 2002 12:56 AM
To: [EMAIL PROTECTED]
Subject: Throttling, once again


Hi,

Wasn't there just a thread on throttling a few weeks ago?

I had a machine hit hard yesterday with a spider that ignored robots.txt.

Load average was over 90 on a dual CPU Enterprise 3500 running Solaris 2.6.
 It's a mod_perl server, but has a few CGI scripts that it handles, and the
spider was hitting one of the CGI scripts over and over.  They were valid
requests, but coming in faster than they were going out.

Under normal usage the CGI scripts are only accessed a few times a day, so
it's not much of a problem have them served by mod_perl.  And under normal
peak loads RAM is not a problem.

The machine also has bandwidth limitation (packet shaper is used to share
the bandwidth).  That combined with the spider didn't help things.  Luckily
there's 4GB so even at a load average of 90 it wasn't really swapping much.
 (Well not when I caught it, anyway).  This spider was using the same IP
for all requests.

Anyway, I remember Randal's Stonehenge::Throttle discussed not too long
ago.  That seems to address this kind of problem.  Is there anything else
to look into?  Since the front-end is mod_perl, it mean I can use mod_perl
throttling solution, too, which is cool.

I realize there's some fundamental hardware issues to solve, but if I can
just keep the spiders from flooding the machine then the machine is getting
by ok.

Also, does anyone have suggestions for testing once throttling is in place?
 I don't want to start cutting off the good customers, but I do want to get
an idea how it acts under load.  ab to the rescue, I suppose.

Thanks much,


--
Bill Moseley
mailto:[EMAIL PROTECTED]




RE: Throttling, once again

2002-04-19 Thread Drew Wymore

I came across the very problem you're having. I use mod_bandwidth, its
actively maintained, allows via IP, directory or any number of ways to
monitor bandwidth usage http://www.cohprog.com/mod_bandwidth.html

Although its not mod_perl related I hope that this helps 
Drew
-Original Message-
From: Jeremy Rusnak [mailto:[EMAIL PROTECTED]] 
Sent: Friday, April 19, 2002 12:06 PM
To: Christian Gilmore; [EMAIL PROTECTED]
Subject: RE: Throttling, once again

Hi,

I looked at the page you mentioned below.  It wasn't really
clear on the page, but what happens when the requests get above
the max allowed?  Are the remaining requests queued or are they
simply given some kind of error message?

There seem to be a number of different modules for this kind of
thing, but most of them seem to be fairly old.  We could use a
more currently throttling module that combines what others have
come up with.  

For example, the snert.com mod_throttle is nice because it does
it based on IP - but it does it site wide in that mode.  This
mod_throttle seems nice because it can be set for an individual
URI...But that's a pain for sites like mine that have 50 or
more intensive scripts (by directory would be nice).  And still
both of these approaches don't use cookies like some of the
others to make sure that legit proxies aren't blocked.

Jeremy

-Original Message-
From: Christian Gilmore [mailto:[EMAIL PROTECTED]]
Sent: Friday, April 19, 2002 8:31 AM
To: 'Bill Moseley'; [EMAIL PROTECTED]
Subject: RE: Throttling, once again


Bill,

If you're looking to throttle access to a particular URI (or set of
URIs),
give mod_throttle_access a look. It is available via the Apache Module
Registry and at http://www.fremen.org/apache/mod_throttle_access.html .

Regards,
Christian

-
Christian Gilmore
Technology Leader
GeT WW Global Applications Development
IBM Software Group


-Original Message-
From: Bill Moseley [mailto:[EMAIL PROTECTED]]
Sent: Friday, April 19, 2002 12:56 AM
To: [EMAIL PROTECTED]
Subject: Throttling, once again


Hi,

Wasn't there just a thread on throttling a few weeks ago?

I had a machine hit hard yesterday with a spider that ignored
robots.txt.

Load average was over 90 on a dual CPU Enterprise 3500 running Solaris
2.6.
 It's a mod_perl server, but has a few CGI scripts that it handles, and
the
spider was hitting one of the CGI scripts over and over.  They were
valid
requests, but coming in faster than they were going out.

Under normal usage the CGI scripts are only accessed a few times a day,
so
it's not much of a problem have them served by mod_perl.  And under
normal
peak loads RAM is not a problem.

The machine also has bandwidth limitation (packet shaper is used to
share
the bandwidth).  That combined with the spider didn't help things.
Luckily
there's 4GB so even at a load average of 90 it wasn't really swapping
much.
 (Well not when I caught it, anyway).  This spider was using the same IP
for all requests.

Anyway, I remember Randal's Stonehenge::Throttle discussed not too long
ago.  That seems to address this kind of problem.  Is there anything
else
to look into?  Since the front-end is mod_perl, it mean I can use
mod_perl
throttling solution, too, which is cool.

I realize there's some fundamental hardware issues to solve, but if I
can
just keep the spiders from flooding the machine then the machine is
getting
by ok.

Also, does anyone have suggestions for testing once throttling is in
place?
 I don't want to start cutting off the good customers, but I do want to
get
an idea how it acts under load.  ab to the rescue, I suppose.

Thanks much,


--
Bill Moseley
mailto:[EMAIL PROTECTED]




Re: Throttling, once again

2002-04-19 Thread Peter Bi

How about adding a MD5 watermark for the cookie ? Well, it is becoming
complicated 

Peter Bi

- Original Message -
From: kyle dawkins [EMAIL PROTECTED]
To: Peter Bi [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Friday, April 19, 2002 8:29 AM
Subject: Re: Throttling, once again


 Peter

 Storing the last access time, etc in a cookie won't work for a perl script
 that's abusing your site, or pretty much any spider, or even for anyone
 browsing without cookies, for that matter.

 The hit on the DB is so short and sweet and happens after the response has
 been sent to the user so they don't notice any delay and the apache child
 takes all of five hundredths of a second more to clean up.

 Kyle Dawkins
 Central Park Software

 On Friday 19 April 2002 11:18, Peter Bi wrote:
  If merely the last access time and number of requests within a given
time
  interval are needed, I think the fastest way is to record them in a
cookie,
  and check them via an access control. Unfortunately, access control is
  called before content handler, so the idea can't be used for CPU or
  bandwidth throttles. In the later cases, one has to call DB/file/memory
for
  history.
 
  Peter Bi
 
 
  - Original Message -
  From: kyle dawkins [EMAIL PROTECTED]
  To: [EMAIL PROTECTED]
  Sent: Friday, April 19, 2002 8:02 AM
  Subject: Re: Throttling, once again
 
   Guys
  
   We also have a problem with evil clients. It's not always spiders...
in
 
  fact
 
   more often than not it's some smart-ass with a customised perl script
   designed to screen-scrape all our data (usually to get email addresses
   for spam purposes).
  
   Our solution, which works pretty well, is to have a LogHandler that
   checks
 
  the
 
   IP address of an incoming request and stores some information in the
DB
 
  about
 
   that client; when it was last seen, how many requests it's made in the
 
  past n
 
   seconds, etc.  It means a DB hit on every request but it's pretty
light,
 
  all
 
   things considered.
  
   We then have an external process that wakes up every minute or so and
 
  checks
 
   the DB for badly-behaved clients.  If it finds such clients, we get
email
 
  and
 
   the IP is written into a file that is read by mod_rewrite, which sends
   bad clients to, well, wherever... http://www.microsoft.com is a good
one
   :-)
  
   It works great.  Of course, mod_throttle sounds pretty cool and maybe
   I'll test it out on our servers.  There are definitely more ways to do
   this...
  
   Which reminds me, you HAVE to make sure that your apache children are
   size-limited and you have a MaxClients setting where MaxClients *
 
  SizeLimit 
 
   Free Memory.  If you don't, and you get slammed by one of these
wankers,
 
  your
 
   server will swap and then you'll lose all the benefits of shared
memory
 
  that
 
   apache and mod_perl offer us.  Check the thread out that was all over
the
   list about a  month ago for more information.  Basically, avoid
swapping
 
  at
 
   ALL costs.
  
  
   Kyle Dawkins
   Central Park Software
  
   On Friday 19 April 2002 08:55, Marc Slagle wrote:
We never tried mod_throttle, it might be the best solution.  Also,
one
thing to keep in mind is that some search engines will come from
 
  multiple
 
IP addresses/user-agents at once, making them more difficult to
stop.






Re: Throttling, once again

2002-04-18 Thread Matt Sergeant

On Friday 19 April 2002 6:55 am, Bill Moseley wrote:
 Hi,

 Wasn't there just a thread on throttling a few weeks ago?

 I had a machine hit hard yesterday with a spider that ignored robots.txt.

I thought the standard practice these days was to put some URL at an 
un-reachable place (by a human), for example using something like a 
href=.../a. And then ban that via robots.txt. And then automatically 
update your routing tables for any IP addresses that try and visit that URL.

Just a thought, there's probably more to it.

Matt.