Re: Throttling, once again

2002-04-25 Thread Randal L. Schwartz

> "Christian" == Christian Gilmore <[EMAIL PROTECTED]> writes:

Christian> Hi, Drew.
>> I came across the very problem you're having. I use mod_bandwidth, its
>> actively maintained, allows via IP, directory or any number of ways to
>> monitor bandwidth usage http://www.cohprog.com/mod_bandwidth.html

Christian> The size of the data sent through the pipe doesn't reflect the CPU spent to
Christian> produce that data. mod_bandwidth probably doesn't apply in the current
Christian> scenario being discussed.

 which is why I wrote Stonehenge::Throttle.

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<[EMAIL PROTECTED]> http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!



RE: Throttling, once again

2002-04-22 Thread Christian Gilmore

Hi, Drew.

> I came across the very problem you're having. I use mod_bandwidth, its
> actively maintained, allows via IP, directory or any number of ways to
> monitor bandwidth usage http://www.cohprog.com/mod_bandwidth.html

The size of the data sent through the pipe doesn't reflect the CPU spent to
produce that data. mod_bandwidth probably doesn't apply in the current
scenario being discussed.

Thanks,
Christian

-
Christian Gilmore
Technology Leader
GeT WW Global Applications Development
IBM Software Group





RE: Throttling, once again

2002-04-22 Thread Christian Gilmore

Hi, Jeremy.

> I looked at the page you mentioned below.  It wasn't really
> clear on the page, but what happens when the requests get above
> the max allowed?  Are the remaining requests queued or are they
> simply given some kind of error message?

The service will respond with an HTTP 503 message when the MaxConcurrentReqs
number is reached. That tells the browser that the service is temporarily
unavailable and to try again later.

> There seem to be a number of different modules for this kind of
> thing, but most of them seem to be fairly old.  We could use a
> more currently throttling module that combines what others have
> come up with.

Age shouldn't matter. If something works as designed, it doesn't need to be
updated. :)

> For example, the snert.com mod_throttle is nice because it does
> it based on IP - but it does it site wide in that mode.  This
> mod_throttle seems nice because it can be set for an individual
> URI...But that's a pain for sites like mine that have 50 or
> more intensive scripts (by directory would be nice).  And still
> both of these approaches don't use cookies like some of the
> others to make sure that legit proxies aren't blocked.

Well, the design goals of each are probably different. For instance,
mod_throttle_access was designed to keep a service healthy, not punish a set
of over-zealous users. Blocking by IP doesn't necessarily protect the health
of your service. Also, you shouldn't rely on cookies to ensure the health of
your service. If someone has cookies disabled, they can defeat your scheme.

BTW, mod_throttle_access is a per-directory module (ie, by ,
, or ), so you can protect an entire tree at once. It will
just count that entire tree as one unit during its count toward
MaxConcurrentReqs.

Regards,
Christian

-
Christian Gilmore
Technology Leader
GeT WW Global Applications Development
IBM Software Group





Re: Throttling, once again

2002-04-21 Thread Steve Piner


You're assuming that the spider respects cookies. I would not expect
this to be the case.


Peter Bi wrote:
> 
> How about adding a MD5 watermark for the cookie ? Well, it is becoming
> complicated 
> 
> Peter Bi
> 
> - Original Message -
> From: "kyle dawkins" <[EMAIL PROTECTED]>
> To: "Peter Bi" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
> Sent: Friday, April 19, 2002 8:29 AM
> Subject: Re: Throttling, once again
> 
> > Peter
> >
> > Storing the last access time, etc in a cookie won't work for a perl script
> > that's abusing your site, or pretty much any spider, or even for anyone
> > browsing without cookies, for that matter.

-- 
Steve Piner
Web Applications Developer
Marketview Limited
http://www.marketview.co.nz



Re: Throttling, once again

2002-04-19 Thread Peter Bi

How about adding a MD5 watermark for the cookie ? Well, it is becoming
complicated 

Peter Bi

- Original Message -
From: "kyle dawkins" <[EMAIL PROTECTED]>
To: "Peter Bi" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Friday, April 19, 2002 8:29 AM
Subject: Re: Throttling, once again


> Peter
>
> Storing the last access time, etc in a cookie won't work for a perl script
> that's abusing your site, or pretty much any spider, or even for anyone
> browsing without cookies, for that matter.
>
> The hit on the DB is so short and sweet and happens after the response has
> been sent to the user so they don't notice any delay and the apache child
> takes all of five hundredths of a second more to clean up.
>
> Kyle Dawkins
> Central Park Software
>
> On Friday 19 April 2002 11:18, Peter Bi wrote:
> > If merely the last access time and number of requests within a given
time
> > interval are needed, I think the fastest way is to record them in a
cookie,
> > and check them via an access control. Unfortunately, access control is
> > called before content handler, so the idea can't be used for CPU or
> > bandwidth throttles. In the later cases, one has to call DB/file/memory
for
> > history.
> >
> > Peter Bi
> >
> >
> > - Original Message -
> > From: "kyle dawkins" <[EMAIL PROTECTED]>
> > To: <[EMAIL PROTECTED]>
> > Sent: Friday, April 19, 2002 8:02 AM
> > Subject: Re: Throttling, once again
> >
> > > Guys
> > >
> > > We also have a problem with evil clients. It's not always spiders...
in
> >
> > fact
> >
> > > more often than not it's some smart-ass with a customised perl script
> > > designed to screen-scrape all our data (usually to get email addresses
> > > for spam purposes).
> > >
> > > Our solution, which works pretty well, is to have a LogHandler that
> > > checks
> >
> > the
> >
> > > IP address of an incoming request and stores some information in the
DB
> >
> > about
> >
> > > that client; when it was last seen, how many requests it's made in the
> >
> > past n
> >
> > > seconds, etc.  It means a DB hit on every request but it's pretty
light,
> >
> > all
> >
> > > things considered.
> > >
> > > We then have an external process that wakes up every minute or so and
> >
> > checks
> >
> > > the DB for badly-behaved clients.  If it finds such clients, we get
email
> >
> > and
> >
> > > the IP is written into a file that is read by mod_rewrite, which sends
> > > bad clients to, well, wherever... http://www.microsoft.com is a good
one
> > > :-)
> > >
> > > It works great.  Of course, mod_throttle sounds pretty cool and maybe
> > > I'll test it out on our servers.  There are definitely more ways to do
> > > this...
> > >
> > > Which reminds me, you HAVE to make sure that your apache children are
> > > size-limited and you have a MaxClients setting where MaxClients *
> >
> > SizeLimit <
> >
> > > Free Memory.  If you don't, and you get slammed by one of these
wankers,
> >
> > your
> >
> > > server will swap and then you'll lose all the benefits of shared
memory
> >
> > that
> >
> > > apache and mod_perl offer us.  Check the thread out that was all over
the
> > > list about a  month ago for more information.  Basically, avoid
swapping
> >
> > at
> >
> > > ALL costs.
> > >
> > >
> > > Kyle Dawkins
> > > Central Park Software
> > >
> > > On Friday 19 April 2002 08:55, Marc Slagle wrote:
> > > > We never tried mod_throttle, it might be the best solution.  Also,
one
> > > > thing to keep in mind is that some search engines will come from
> >
> > multiple
> >
> > > > IP addresses/user-agents at once, making them more difficult to
stop.
>
>




RE: Throttling, once again

2002-04-19 Thread Drew Wymore

I came across the very problem you're having. I use mod_bandwidth, its
actively maintained, allows via IP, directory or any number of ways to
monitor bandwidth usage http://www.cohprog.com/mod_bandwidth.html

Although its not mod_perl related I hope that this helps 
Drew
-Original Message-
From: Jeremy Rusnak [mailto:[EMAIL PROTECTED]] 
Sent: Friday, April 19, 2002 12:06 PM
To: Christian Gilmore; [EMAIL PROTECTED]
Subject: RE: Throttling, once again

Hi,

I looked at the page you mentioned below.  It wasn't really
clear on the page, but what happens when the requests get above
the max allowed?  Are the remaining requests queued or are they
simply given some kind of error message?

There seem to be a number of different modules for this kind of
thing, but most of them seem to be fairly old.  We could use a
more currently throttling module that combines what others have
come up with.  

For example, the snert.com mod_throttle is nice because it does
it based on IP - but it does it site wide in that mode.  This
mod_throttle seems nice because it can be set for an individual
URI...But that's a pain for sites like mine that have 50 or
more intensive scripts (by directory would be nice).  And still
both of these approaches don't use cookies like some of the
others to make sure that legit proxies aren't blocked.

Jeremy

-Original Message-
From: Christian Gilmore [mailto:[EMAIL PROTECTED]]
Sent: Friday, April 19, 2002 8:31 AM
To: 'Bill Moseley'; [EMAIL PROTECTED]
Subject: RE: Throttling, once again


Bill,

If you're looking to throttle access to a particular URI (or set of
URIs),
give mod_throttle_access a look. It is available via the Apache Module
Registry and at http://www.fremen.org/apache/mod_throttle_access.html .

Regards,
Christian

-
Christian Gilmore
Technology Leader
GeT WW Global Applications Development
IBM Software Group


-Original Message-
From: Bill Moseley [mailto:[EMAIL PROTECTED]]
Sent: Friday, April 19, 2002 12:56 AM
To: [EMAIL PROTECTED]
Subject: Throttling, once again


Hi,

Wasn't there just a thread on throttling a few weeks ago?

I had a machine hit hard yesterday with a spider that ignored
robots.txt.

Load average was over 90 on a dual CPU Enterprise 3500 running Solaris
2.6.
 It's a mod_perl server, but has a few CGI scripts that it handles, and
the
spider was hitting one of the CGI scripts over and over.  They were
valid
requests, but coming in faster than they were going out.

Under normal usage the CGI scripts are only accessed a few times a day,
so
it's not much of a problem have them served by mod_perl.  And under
normal
peak loads RAM is not a problem.

The machine also has bandwidth limitation (packet shaper is used to
share
the bandwidth).  That combined with the spider didn't help things.
Luckily
there's 4GB so even at a load average of 90 it wasn't really swapping
much.
 (Well not when I caught it, anyway).  This spider was using the same IP
for all requests.

Anyway, I remember Randal's Stonehenge::Throttle discussed not too long
ago.  That seems to address this kind of problem.  Is there anything
else
to look into?  Since the front-end is mod_perl, it mean I can use
mod_perl
throttling solution, too, which is cool.

I realize there's some fundamental hardware issues to solve, but if I
can
just keep the spiders from flooding the machine then the machine is
getting
by ok.

Also, does anyone have suggestions for testing once throttling is in
place?
 I don't want to start cutting off the good customers, but I do want to
get
an idea how it acts under load.  ab to the rescue, I suppose.

Thanks much,


--
Bill Moseley
mailto:[EMAIL PROTECTED]




RE: Throttling, once again

2002-04-19 Thread Jeremy Rusnak

Hi,

I looked at the page you mentioned below.  It wasn't really
clear on the page, but what happens when the requests get above
the max allowed?  Are the remaining requests queued or are they
simply given some kind of error message?

There seem to be a number of different modules for this kind of
thing, but most of them seem to be fairly old.  We could use a
more currently throttling module that combines what others have
come up with.  

For example, the snert.com mod_throttle is nice because it does
it based on IP - but it does it site wide in that mode.  This
mod_throttle seems nice because it can be set for an individual
URI...But that's a pain for sites like mine that have 50 or
more intensive scripts (by directory would be nice).  And still
both of these approaches don't use cookies like some of the
others to make sure that legit proxies aren't blocked.

Jeremy

-Original Message-
From: Christian Gilmore [mailto:[EMAIL PROTECTED]]
Sent: Friday, April 19, 2002 8:31 AM
To: 'Bill Moseley'; [EMAIL PROTECTED]
Subject: RE: Throttling, once again


Bill,

If you're looking to throttle access to a particular URI (or set of URIs),
give mod_throttle_access a look. It is available via the Apache Module
Registry and at http://www.fremen.org/apache/mod_throttle_access.html .

Regards,
Christian

-
Christian Gilmore
Technology Leader
GeT WW Global Applications Development
IBM Software Group


-Original Message-
From: Bill Moseley [mailto:[EMAIL PROTECTED]]
Sent: Friday, April 19, 2002 12:56 AM
To: [EMAIL PROTECTED]
Subject: Throttling, once again


Hi,

Wasn't there just a thread on throttling a few weeks ago?

I had a machine hit hard yesterday with a spider that ignored robots.txt.

Load average was over 90 on a dual CPU Enterprise 3500 running Solaris 2.6.
 It's a mod_perl server, but has a few CGI scripts that it handles, and the
spider was hitting one of the CGI scripts over and over.  They were valid
requests, but coming in faster than they were going out.

Under normal usage the CGI scripts are only accessed a few times a day, so
it's not much of a problem have them served by mod_perl.  And under normal
peak loads RAM is not a problem.

The machine also has bandwidth limitation (packet shaper is used to share
the bandwidth).  That combined with the spider didn't help things.  Luckily
there's 4GB so even at a load average of 90 it wasn't really swapping much.
 (Well not when I caught it, anyway).  This spider was using the same IP
for all requests.

Anyway, I remember Randal's Stonehenge::Throttle discussed not too long
ago.  That seems to address this kind of problem.  Is there anything else
to look into?  Since the front-end is mod_perl, it mean I can use mod_perl
throttling solution, too, which is cool.

I realize there's some fundamental hardware issues to solve, but if I can
just keep the spiders from flooding the machine then the machine is getting
by ok.

Also, does anyone have suggestions for testing once throttling is in place?
 I don't want to start cutting off the good customers, but I do want to get
an idea how it acts under load.  ab to the rescue, I suppose.

Thanks much,


--
Bill Moseley
mailto:[EMAIL PROTECTED]




RE: Throttling, once again

2002-04-19 Thread Wim Kerkhoff

On 19-Apr-2002 Bill Moseley wrote:

> Also, does anyone have suggestions for testing once throttling is in place?
>  I don't want to start cutting off the good customers, but I do want to get
> an idea how it acts under load.  ab to the rescue, I suppose.

wget supports recursive spidering. Or try IE or another browser that will save
an entire site to disk. The scripts that the cherry pickers use to garner email
addresses are probably all over the net too...

for i in `seq 1 10`; do 
 cd /tmp
 mkdir wget$i
 cd wget$i 
 wget -rb http://localhost/wiki
done

Load on my server immediately jumped from < 1 to 12 when I did this.

Wim



Re: Throttling, once again

2002-04-19 Thread Perrin Harkins

Hi Bill,

> Wasn't there just a thread on throttling a few weeks ago?

There have been many.  Here's my answer to one of them: 
004101c0f2cc$9d14a540$[EMAIL PROTECTED]">http://mathforum.org/epigone/modperl/blexblolgang/004101c0f2cc$9d14a540$[EMAIL PROTECTED]

> Anyway, I remember Randal's Stonehenge::Throttle discussed not too long
> ago.

What we did was take that module, replace the CPU monitoring stuff with 
simple hit counting, and make it just return an "Access Denied" when the 
limit is  hit.  We also did some slightly fancier stuff than just 
blocking by IP, since we were concerned about proxy servers getting 
blocked.  You may not need to bother with that though.

One nice thing about Randal's module is that it uses disk (not shared 
memory) and worked well across a cluster sharing the directory over NFS.

> Also, does anyone have suggestions for testing once throttling is in place?
>  I don't want to start cutting off the good customers, but I do want to get
> an idea how it acts under load.  ab to the rescue, I suppose.

That will work, as will http_load or httperf.

- Perrin




RE: Throttling, once again

2002-04-19 Thread Christian Gilmore

Bill,

If you're looking to throttle access to a particular URI (or set of URIs),
give mod_throttle_access a look. It is available via the Apache Module
Registry and at http://www.fremen.org/apache/mod_throttle_access.html .

Regards,
Christian

-
Christian Gilmore
Technology Leader
GeT WW Global Applications Development
IBM Software Group


-Original Message-
From: Bill Moseley [mailto:[EMAIL PROTECTED]]
Sent: Friday, April 19, 2002 12:56 AM
To: [EMAIL PROTECTED]
Subject: Throttling, once again


Hi,

Wasn't there just a thread on throttling a few weeks ago?

I had a machine hit hard yesterday with a spider that ignored robots.txt.

Load average was over 90 on a dual CPU Enterprise 3500 running Solaris 2.6.
 It's a mod_perl server, but has a few CGI scripts that it handles, and the
spider was hitting one of the CGI scripts over and over.  They were valid
requests, but coming in faster than they were going out.

Under normal usage the CGI scripts are only accessed a few times a day, so
it's not much of a problem have them served by mod_perl.  And under normal
peak loads RAM is not a problem.

The machine also has bandwidth limitation (packet shaper is used to share
the bandwidth).  That combined with the spider didn't help things.  Luckily
there's 4GB so even at a load average of 90 it wasn't really swapping much.
 (Well not when I caught it, anyway).  This spider was using the same IP
for all requests.

Anyway, I remember Randal's Stonehenge::Throttle discussed not too long
ago.  That seems to address this kind of problem.  Is there anything else
to look into?  Since the front-end is mod_perl, it mean I can use mod_perl
throttling solution, too, which is cool.

I realize there's some fundamental hardware issues to solve, but if I can
just keep the spiders from flooding the machine then the machine is getting
by ok.

Also, does anyone have suggestions for testing once throttling is in place?
 I don't want to start cutting off the good customers, but I do want to get
an idea how it acts under load.  ab to the rescue, I suppose.

Thanks much,


--
Bill Moseley
mailto:[EMAIL PROTECTED]




Re: Throttling, once again

2002-04-19 Thread kyle dawkins

Peter

Storing the last access time, etc in a cookie won't work for a perl script 
that's abusing your site, or pretty much any spider, or even for anyone 
browsing without cookies, for that matter.

The hit on the DB is so short and sweet and happens after the response has 
been sent to the user so they don't notice any delay and the apache child 
takes all of five hundredths of a second more to clean up.

Kyle Dawkins
Central Park Software

On Friday 19 April 2002 11:18, Peter Bi wrote:
> If merely the last access time and number of requests within a given time
> interval are needed, I think the fastest way is to record them in a cookie,
> and check them via an access control. Unfortunately, access control is
> called before content handler, so the idea can't be used for CPU or
> bandwidth throttles. In the later cases, one has to call DB/file/memory for
> history.
>
> Peter Bi
>
>
> - Original Message -
> From: "kyle dawkins" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Friday, April 19, 2002 8:02 AM
> Subject: Re: Throttling, once again
>
> > Guys
> >
> > We also have a problem with evil clients. It's not always spiders... in
>
> fact
>
> > more often than not it's some smart-ass with a customised perl script
> > designed to screen-scrape all our data (usually to get email addresses
> > for spam purposes).
> >
> > Our solution, which works pretty well, is to have a LogHandler that
> > checks
>
> the
>
> > IP address of an incoming request and stores some information in the DB
>
> about
>
> > that client; when it was last seen, how many requests it's made in the
>
> past n
>
> > seconds, etc.  It means a DB hit on every request but it's pretty light,
>
> all
>
> > things considered.
> >
> > We then have an external process that wakes up every minute or so and
>
> checks
>
> > the DB for badly-behaved clients.  If it finds such clients, we get email
>
> and
>
> > the IP is written into a file that is read by mod_rewrite, which sends
> > bad clients to, well, wherever... http://www.microsoft.com is a good one
> > :-)
> >
> > It works great.  Of course, mod_throttle sounds pretty cool and maybe
> > I'll test it out on our servers.  There are definitely more ways to do
> > this...
> >
> > Which reminds me, you HAVE to make sure that your apache children are
> > size-limited and you have a MaxClients setting where MaxClients *
>
> SizeLimit <
>
> > Free Memory.  If you don't, and you get slammed by one of these wankers,
>
> your
>
> > server will swap and then you'll lose all the benefits of shared memory
>
> that
>
> > apache and mod_perl offer us.  Check the thread out that was all over the
> > list about a  month ago for more information.  Basically, avoid swapping
>
> at
>
> > ALL costs.
> >
> >
> > Kyle Dawkins
> > Central Park Software
> >
> > On Friday 19 April 2002 08:55, Marc Slagle wrote:
> > > We never tried mod_throttle, it might be the best solution.  Also, one
> > > thing to keep in mind is that some search engines will come from
>
> multiple
>
> > > IP addresses/user-agents at once, making them more difficult to stop.




Re: Throttling, once again

2002-04-19 Thread Peter Bi

If merely the last access time and number of requests within a given time
interval are needed, I think the fastest way is to record them in a cookie,
and check them via an access control. Unfortunately, access control is
called before content handler, so the idea can't be used for CPU or
bandwidth throttles. In the later cases, one has to call DB/file/memory for
history.

Peter Bi


- Original Message -
From: "kyle dawkins" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Friday, April 19, 2002 8:02 AM
Subject: Re: Throttling, once again


> Guys
>
> We also have a problem with evil clients. It's not always spiders... in
fact
> more often than not it's some smart-ass with a customised perl script
> designed to screen-scrape all our data (usually to get email addresses for
> spam purposes).
>
> Our solution, which works pretty well, is to have a LogHandler that checks
the
> IP address of an incoming request and stores some information in the DB
about
> that client; when it was last seen, how many requests it's made in the
past n
> seconds, etc.  It means a DB hit on every request but it's pretty light,
all
> things considered.
>
> We then have an external process that wakes up every minute or so and
checks
> the DB for badly-behaved clients.  If it finds such clients, we get email
and
> the IP is written into a file that is read by mod_rewrite, which sends bad
> clients to, well, wherever... http://www.microsoft.com is a good one :-)
>
> It works great.  Of course, mod_throttle sounds pretty cool and maybe I'll
> test it out on our servers.  There are definitely more ways to do this...
>
> Which reminds me, you HAVE to make sure that your apache children are
> size-limited and you have a MaxClients setting where MaxClients *
SizeLimit <
> Free Memory.  If you don't, and you get slammed by one of these wankers,
your
> server will swap and then you'll lose all the benefits of shared memory
that
> apache and mod_perl offer us.  Check the thread out that was all over the
> list about a  month ago for more information.  Basically, avoid swapping
at
> ALL costs.
>
>
> Kyle Dawkins
> Central Park Software
>
> On Friday 19 April 2002 08:55, Marc Slagle wrote:
> > We never tried mod_throttle, it might be the best solution.  Also, one
> > thing to keep in mind is that some search engines will come from
multiple
> > IP addresses/user-agents at once, making them more difficult to stop.
>
>




Re: Throttling, once again

2002-04-19 Thread kyle dawkins

Guys

We also have a problem with evil clients. It's not always spiders... in fact 
more often than not it's some smart-ass with a customised perl script 
designed to screen-scrape all our data (usually to get email addresses for 
spam purposes).

Our solution, which works pretty well, is to have a LogHandler that checks the 
IP address of an incoming request and stores some information in the DB about 
that client; when it was last seen, how many requests it's made in the past n 
seconds, etc.  It means a DB hit on every request but it's pretty light, all 
things considered.

We then have an external process that wakes up every minute or so and checks 
the DB for badly-behaved clients.  If it finds such clients, we get email and 
the IP is written into a file that is read by mod_rewrite, which sends bad 
clients to, well, wherever... http://www.microsoft.com is a good one :-)

It works great.  Of course, mod_throttle sounds pretty cool and maybe I'll 
test it out on our servers.  There are definitely more ways to do this...

Which reminds me, you HAVE to make sure that your apache children are 
size-limited and you have a MaxClients setting where MaxClients * SizeLimit < 
Free Memory.  If you don't, and you get slammed by one of these wankers, your 
server will swap and then you'll lose all the benefits of shared memory that 
apache and mod_perl offer us.  Check the thread out that was all over the 
list about a  month ago for more information.  Basically, avoid swapping at 
ALL costs.


Kyle Dawkins
Central Park Software

On Friday 19 April 2002 08:55, Marc Slagle wrote:
> We never tried mod_throttle, it might be the best solution.  Also, one
> thing to keep in mind is that some search engines will come from multiple
> IP addresses/user-agents at once, making them more difficult to stop.




Re: Throttling, once again

2002-04-19 Thread Marc Slagle

When this happened to our clients servers we ended up trying some of the
mod_perl based solutions.  We tried some of the modules that used shared
memory, but the traffic on our site quickly filled our shared memory and
made the module unuseable.  After that we tried blocking the agents
altogether, and there is example code in the Eagle book (Apache::BlockAgent)
that worked pretty well.

You might be able to place some of that code in your CGI, denying the search
engines agents/IPs from accessing it, while allowing real users in.  That
way the search engines can still get static pages.

We never tried mod_throttle, it might be the best solution.  Also, one thing
to keep in mind is that some search engines will come from multiple IP
addresses/user-agents at once, making them more difficult to stop.

> Hi,
>
> Wasn't there just a thread on throttling a few weeks ago?
>
> I had a machine hit hard yesterday with a spider that ignored robots.txt.
>
> Load average was over 90 on a dual CPU Enterprise 3500 running Solaris
2.6.
>  It's a mod_perl server, but has a few CGI scripts that it handles, and
the
> spider was hitting one of the CGI scripts over and over.  They were valid
> requests, but coming in faster than they were going out.
>
> Under normal usage the CGI scripts are only accessed a few times a day, so
> it's not much of a problem have them served by mod_perl.  And under normal
> peak loads RAM is not a problem.
>
> The machine also has bandwidth limitation (packet shaper is used to share
> the bandwidth).  That combined with the spider didn't help things.
Luckily
> there's 4GB so even at a load average of 90 it wasn't really swapping
much.
>  (Well not when I caught it, anyway).  This spider was using the same IP
> for all requests.
>
> Anyway, I remember Randal's Stonehenge::Throttle discussed not too long
> ago.  That seems to address this kind of problem.  Is there anything else
> to look into?  Since the front-end is mod_perl, it mean I can use mod_perl
> throttling solution, too, which is cool.
>
> I realize there's some fundamental hardware issues to solve, but if I can
> just keep the spiders from flooding the machine then the machine is
getting
> by ok.
>
> Also, does anyone have suggestions for testing once throttling is in
place?
>  I don't want to start cutting off the good customers, but I do want to
get
> an idea how it acts under load.  ab to the rescue, I suppose.
>
> Thanks much,
>
>
> --
> Bill Moseley
> mailto:[EMAIL PROTECTED]
>




RE: Throttling, once again

2002-04-18 Thread Jeremy Rusnak

Hi,

I *HIGHLY* recommend mod_throttle for Apache.  It is very
configurable.  You can get the software at
http://www.snert.com/Software/mod_throttle/index.shtml .

The best thing about it is the ability to throttle based
on bandwidth and client IP.  We had problems with robots
as well as malicious end users who would flood our
server with requests.

mod_throttle allows you to set up rules to prevent one
IP address from making more than x requests for the
same document in y time period.  Our mod_perl servers,
for example, track the last 50 client IPs.  If one of
those clients goes about 50 requests, it is blocked
out.  The last client that requests a document is put
at the top of the list, so even very active legit users
tend to fall off the bottom, but things like robots
stay blocked.

I highly recommend you look into it.  We were doing some
custom writting functions to block this kind of thing,
but the Apache module makes it so much nicer.

Jeremy

-Original Message-
From: Bill Moseley [mailto:[EMAIL PROTECTED]]
Sent: Thursday, April 18, 2002 10:56 PM
To: [EMAIL PROTECTED]
Subject: Throttling, once again


Hi,

Wasn't there just a thread on throttling a few weeks ago?

I had a machine hit hard yesterday with a spider that ignored robots.txt.  

Load average was over 90 on a dual CPU Enterprise 3500 running Solaris 2.6.
 It's a mod_perl server, but has a few CGI scripts that it handles, and the
spider was hitting one of the CGI scripts over and over.  They were valid
requests, but coming in faster than they were going out.

Under normal usage the CGI scripts are only accessed a few times a day, so
it's not much of a problem have them served by mod_perl.  And under normal
peak loads RAM is not a problem.  

The machine also has bandwidth limitation (packet shaper is used to share
the bandwidth).  That combined with the spider didn't help things.  Luckily
there's 4GB so even at a load average of 90 it wasn't really swapping much.
 (Well not when I caught it, anyway).  This spider was using the same IP
for all requests.

Anyway, I remember Randal's Stonehenge::Throttle discussed not too long
ago.  That seems to address this kind of problem.  Is there anything else
to look into?  Since the front-end is mod_perl, it mean I can use mod_perl
throttling solution, too, which is cool.

I realize there's some fundamental hardware issues to solve, but if I can
just keep the spiders from flooding the machine then the machine is getting
by ok.

Also, does anyone have suggestions for testing once throttling is in place?
 I don't want to start cutting off the good customers, but I do want to get
an idea how it acts under load.  ab to the rescue, I suppose.

Thanks much,


-- 
Bill Moseley
mailto:[EMAIL PROTECTED]




Re: Throttling, once again

2002-04-18 Thread Matt Sergeant

On Friday 19 April 2002 6:55 am, Bill Moseley wrote:
> Hi,
>
> Wasn't there just a thread on throttling a few weeks ago?
>
> I had a machine hit hard yesterday with a spider that ignored robots.txt.

I thought the standard practice these days was to put some URL at an 
un-reachable place (by a human), for example using something like . And then ban that via robots.txt. And then automatically 
update your routing tables for any IP addresses that try and visit that URL.

Just a thought, there's probably more to it.

Matt.