Re: Preventing bots from starving other users?

2009-11-16 Thread Willy Tarreau
Hi,

On Mon, Nov 16, 2009 at 04:33:34PM +0100, Wout Mertens wrote:
> Schweet! I'll give that a shot.

If you want to experiment a bit, with version 1.4 (development),
you can even add a delay to all the requests from this boat. The
idea is to identify the bot with an ACL and tell the TCP layer
to wait for the full evaluation time before forwarding the request :

For instance, let's say that the bot does not set any user-agent.
We then consider that any request with a user agent is a valid
request :

frontend xxx
   ...
   acl valid_req hdr_cnt(user-agent) gt 0
   tcp-request inspect-delay 5s# the time to wait for those which match
   tcp-request content accept if HTTP valid_req  # valid request passes
   tcp-request content accept if HTTP WAIT_END   # other ones wait
   tcp-request content reject# non-HTTP is rejected

You can already do that with 1.3.22 but only based on a layer 4
information (namely, the source IP address) :

   acl valid_src src 192.168.0.0/16
   tcp-request inspect-delay 5s# the time to wait for those which match
   tcp-request content accept if valid_src  # valid request passes
   tcp-request content accept if WAIT_END   # other ones wait

Or if you know the bot :

   acl bot_src src 10.20.30.40
   tcp-request inspect-delay 5s# the time to wait for those which match
   tcp-request content accept if bot_src WAIT_END  # bot waits
   tcp-request content accept  # other ones pass

With 1.4, it is even possible to combine that with cookies.
Imagine that you add a small delay (eg: 1 second) for the
first request of every user, then assign them a cookie and
don't set the delay after that. If the bot does not learn
the cookie (very likely), it will always suffer from the
delay, for each request :

frontend xxx
   acl seen hdr_sub(cookie) SEEN=1
   tcp-request inspect-delay 1s # the time to wait for new users
   tcp-request content accept if HTTP seen  # valid request passes
   tcp-request content accept if HTTP WAIT_END # other ones wait
   tcp-request content reject   # non-HTTP is rejected
   rspadd Set-Cookie: SEEN=1# do not harm real browsers

Good luck !

Willy




Re: Preventing bots from starving other users?

2009-11-16 Thread Wout Mertens
Schweet! I'll give that a shot.

Wout.

On Nov 16, 2009, at 4:08 PM, Karsten Elfenbein wrote:

> you can just create the backend in haproxy and use the same backend server 
> definition
> no need to reconfigure apache
> 
> put like 7 max sessions for normal users on one backend and 2 for maxsessions 
> on the bot backend
> throw in some queues and you are set
> 
> Karsten
> 
> Am Montag, 16. November 2009 schrieben Sie:
>> On Nov 16, 2009, at 1:47 PM, Karsten Elfenbein wrote:
>>> Just create an additional backend and assign the bots to it.
>>> You can set queues and max connections there as needed.
>> 
>> Yes, you're right - that's probably the best solution. I'll create an extra
>> apache process on the same server that will handle the bot subnet. No
>> extra hardware needed. Thanks!
>> 
>> The wiki in question is TWiki - very flexible but very bad at caching what
>> it does. Basically, for each page view the complete interpreter and all
>> plugins get loaded.
>> 
>> Wout.
>> 
> 
> 
> -- 
> Mit freundlichen Grüßen
> 
> Karsten Elfenbein
> Entwicklung und Systemadministration
> 
> erento - Der Online-Marktplatz für Mietartikel.
> 
> erento GmbH
> Friedenstrasse 91
> D-10249 Berlin
> 
> Tel: +49 (30) 2000 42064
> Fax: +49 (30) 2000  8499
> eMail:   karsten.elfenb...@erento.com
> 
> - - - - - - - - - - - - - - - - - - - - - - - - - -
> Hotline: 01805 - 373 686 (14 ct/min.)
> Firmensitz der erento GmbH ist Berlin
> Geschäftsführer: Chris Möller & Oliver Weyergraf
> Handelsregister Berlin Charlottenburg,  HRB 101206B
> - - - - - - - - - - - - - - - - - - - - - - - - - -
> http://www.erento.com - alles online mieten.




Re: Preventing bots from starving other users?

2009-11-16 Thread Karsten Elfenbein
you can just create the backend in haproxy and use the same backend server 
definition
no need to reconfigure apache

put like 7 max sessions for normal users on one backend and 2 for maxsessions 
on the bot backend
throw in some queues and you are set

Karsten

Am Montag, 16. November 2009 schrieben Sie:
> On Nov 16, 2009, at 1:47 PM, Karsten Elfenbein wrote:
> > Just create an additional backend and assign the bots to it.
> > You can set queues and max connections there as needed.
> 
> Yes, you're right - that's probably the best solution. I'll create an extra
>  apache process on the same server that will handle the bot subnet. No
>  extra hardware needed. Thanks!
> 
> The wiki in question is TWiki - very flexible but very bad at caching what
>  it does. Basically, for each page view the complete interpreter and all
>  plugins get loaded.
> 
> Wout.
> 


-- 
Mit freundlichen Grüßen

Karsten Elfenbein
Entwicklung und Systemadministration

erento - Der Online-Marktplatz für Mietartikel.

erento GmbH
Friedenstrasse 91
D-10249 Berlin

Tel: +49 (30) 2000 42064
Fax: +49 (30) 2000  8499
eMail:   karsten.elfenb...@erento.com

- - - - - - - - - - - - - - - - - - - - - - - - - -
Hotline: 01805 - 373 686 (14 ct/min.)
Firmensitz der erento GmbH ist Berlin
Geschäftsführer: Chris Möller & Oliver Weyergraf
Handelsregister Berlin Charlottenburg,  HRB 101206B
- - - - - - - - - - - - - - - - - - - - - - - - - -
http://www.erento.com - alles online mieten.



Re: Preventing bots from starving other users?

2009-11-16 Thread German Gutierrez
Perhaps this plugin could be useful, never used, tho:

http://twiki.org/cgi-bin/view/Plugins.TWikiCacheAddOn

On Mon, Nov 16, 2009 at 11:46 AM, Wout Mertens wrote:

> On Nov 16, 2009, at 1:47 PM, Karsten Elfenbein wrote:
>
> > Just create an additional backend and assign the bots to it.
> > You can set queues and max connections there as needed.
>
> Yes, you're right - that's probably the best solution. I'll create an extra
> apache process on the same server that will handle the bot subnet. No extra
> hardware needed. Thanks!
>
> The wiki in question is TWiki - very flexible but very bad at caching what
> it does. Basically, for each page view the complete interpreter and all
> plugins get loaded.
>
> Wout.
>



-- 
Germán Gutiérrez

Infrastructure Team
OLX Inc.
Buenos Aires - Argentina
Phone: 54.11.4775.6696
Mobile: 54.911.5669.6175
Skype: errare_est
Email: germ...@olx.com

Delivering common sense since 1969 .

The Nature is not amiable; It treats impartially to all the things. The wise
person is not amiable; He treats all people impartially.


Re: Preventing bots from starving other users?

2009-11-16 Thread Wout Mertens
On Nov 16, 2009, at 1:47 PM, Karsten Elfenbein wrote:

> Just create an additional backend and assign the bots to it.
> You can set queues and max connections there as needed.

Yes, you're right - that's probably the best solution. I'll create an extra 
apache process on the same server that will handle the bot subnet. No extra 
hardware needed. Thanks!

The wiki in question is TWiki - very flexible but very bad at caching what it 
does. Basically, for each page view the complete interpreter and all plugins 
get loaded.

Wout.


RE: Preventing bots from starving other users?

2009-11-16 Thread John Lauro
Make sure you set KeepAlive to off in Apache.  That keeps more than one
request being queued at a time without multiple connections being open.  You
can also have haproxy do this for you with option httpclose even if it's
enabled in Apache.

You could then use --histcount with iptables rules and limit on the number
of connections / sec based on ip addresses...


> -Original Message-
> From: Wout Mertens [mailto:wout.mert...@gmail.com]
> Sent: Monday, November 16, 2009 9:19 AM
> To: John Lauro
> Cc: haproxy@formilux.org
> Subject: Re: Preventing bots from starving other users?
> 
> On Nov 16, 2009, at 2:43 PM, John Lauro wrote:
> 
> > Oopps, my bad...  It's actually tc and not iptables.  Googletc
> qdisc
> > for some info.
> >
> > You could allow your local ips go unrestricted, and throttle all
> other IPs
> > to 512kb/sec for example.
> 
> Hmmm... The problem isn't the data rate, it's the work associated with
> incoming requests. As soon as a 500 byte request hits, the web server
> has to do a lot of work.
> 
> > What software is the running on?  I assume it's not running under
> apache or
> > there would be some ways to tune apache.  As other have mentioned,
> telling
> > the crawlers to behave themselves or totally ignore the wiki with a
> robots
> > file is probably best.
> 
> Well the web server is Apache, but surprisingly Apache doesn't allow
> for tuning this particular case. Suppose normal request traffic looks
> like (A are users)
> 
> Time ->
> 
> A  A   AA  AA   AAA  AAA A
> 
> With the bot this becomes
> 
> ABB A A BBA BA AABB
> 
> So you can see that normal users are just swamped out of "slots". The
> webserver can render about 9 pages at the same time without impact, but
> it takes a second or more to render. At first I set MaxClients to 9,
> which makes it so the web server doesn't swap to death, but if the bots
> have 8 requests queued up, and then another 8, and another 8, regular
> users have no chance of decent interactivity...
> 
> This may be a corner case due to slow serving, because I'm having a
> hard time finding a way to throttle the bots. I suppose that normally
> you'd just add servers...
> 
> Wout.
> 
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 8.5.425 / Virus Database: 270.14.60/2495 - Release Date:
> 11/16/09 07:43:00




RE: Preventing bots from starving other users?

2009-11-16 Thread John Marrett
You can ask (polite) bots to throttle their request rates and
simultaneous requests. It think that you'd probably be quite interested
in the crawl-delay directive:

http://en.wikipedia.org/wiki/Robots_exclusion_standard#Crawl-delay_direc
tive

This is respected by at least MSN and Yahoo. Unfortunately, it looks
like google may not (or may?) respect it, they propose this alternative:

http://www.google.com/support/webmasters/bin/answer.py?answer=48620

Of course, if you're being scraped by a bot that doesn't respect this
directive or a more malicious scraper it won't help you at all.

-JohnF

 

> -Original Message-
> From: Wout Mertens [mailto:wout.mert...@gmail.com] 
> Sent: November 16, 2009 9:19 AM
> To: John Lauro
> Cc: haproxy@formilux.org
> Subject: Re: Preventing bots from starving other users?
> 
> On Nov 16, 2009, at 2:43 PM, John Lauro wrote:
> 
> > Oopps, my bad...  It's actually tc and not iptables.  
> Googletc qdisc
> > for some info.
> > 
> > You could allow your local ips go unrestricted, and 
> throttle all other IPs
> > to 512kb/sec for example.
> 
> Hmmm... The problem isn't the data rate, it's the work 
> associated with incoming requests. As soon as a 500 byte 
> request hits, the web server has to do a lot of work. 
> 
> > What software is the running on?  I assume it's not running 
> under apache or
> > there would be some ways to tune apache.  As other have 
> mentioned, telling
> > the crawlers to behave themselves or totally ignore the 
> wiki with a robots
> > file is probably best.
> 
> Well the web server is Apache, but surprisingly Apache 
> doesn't allow for tuning this particular case. Suppose normal 
> request traffic looks like (A are users)
> 
> Time ->
> 
> A  A   AA  AA   AAA  AAA A
> 
> With the bot this becomes
> 
> ABB A A BBA BA AABB
> 
> So you can see that normal users are just swamped out of 
> "slots". The webserver can render about 9 pages at the same 
> time without impact, but it takes a second or more to render. 
> At first I set MaxClients to 9, which makes it so the web 
> server doesn't swap to death, but if the bots have 8 requests 
> queued up, and then another 8, and another 8, regular users 
> have no chance of decent interactivity...
> 
> This may be a corner case due to slow serving, because I'm 
> having a hard time finding a way to throttle the bots. I 
> suppose that normally you'd just add servers...
> 
> Wout.
> 



Re: Preventing bots from starving other users?

2009-11-16 Thread Wout Mertens
On Nov 16, 2009, at 2:43 PM, John Lauro wrote:

> Oopps, my bad...  It's actually tc and not iptables.  Googletc qdisc
> for some info.
> 
> You could allow your local ips go unrestricted, and throttle all other IPs
> to 512kb/sec for example.

Hmmm... The problem isn't the data rate, it's the work associated with incoming 
requests. As soon as a 500 byte request hits, the web server has to do a lot of 
work. 

> What software is the running on?  I assume it's not running under apache or
> there would be some ways to tune apache.  As other have mentioned, telling
> the crawlers to behave themselves or totally ignore the wiki with a robots
> file is probably best.

Well the web server is Apache, but surprisingly Apache doesn't allow for tuning 
this particular case. Suppose normal request traffic looks like (A are users)

Time ->

A  A   AA  AA   AAA  AAA A

With the bot this becomes

ABB A A BBA BA AABB

So you can see that normal users are just swamped out of "slots". The webserver 
can render about 9 pages at the same time without impact, but it takes a second 
or more to render. At first I set MaxClients to 9, which makes it so the web 
server doesn't swap to death, but if the bots have 8 requests queued up, and 
then another 8, and another 8, regular users have no chance of decent 
interactivity...

This may be a corner case due to slow serving, because I'm having a hard time 
finding a way to throttle the bots. I suppose that normally you'd just add 
servers...

Wout.


RE: Preventing bots from starving other users?

2009-11-16 Thread John Lauro
Oopps, my bad...  It's actually tc and not iptables.  Googletc qdisc
for some info.

You could allow your local ips go unrestricted, and throttle all other IPs
to 512kb/sec for example.

What software is the running on?  I assume it's not running under apache or
there would be some ways to tune apache.  As other have mentioned, telling
the crawlers to behave themselves or totally ignore the wiki with a robots
file is probably best.

> -Original Message-
> From: Wout Mertens [mailto:wout.mert...@gmail.com]
> Sent: Monday, November 16, 2009 7:31 AM
> To: John Lauro
> Cc: haproxy@formilux.org
> Subject: Re: Preventing bots from starving other users?
> 
> Hi John,
> 
> On Nov 15, 2009, at 8:29 PM, John Lauro wrote:
> 
> > I would probably do that sort of throttling at the OS level with
> iptables,
> > etc...
> 
> Hmmm How? I don't want to throw away the requests, just queue them.
> Looking for iptables rate limiting it seems that you can only drop the
> request.
> 
> Then again:
> 
> > That said, before that I would investigate why the wiki is so slow...
> > Something probably isn't configured right if it chokes with only a
> few
> > simultaneous accesses.  I mean, unless it's embedded server with
> under 32MB
> > of RAM, the hardware should be able to handle that...
> 
> Yeah, it's running pretty old software on a pretty old server. It
> should be upgraded but that is a fair bit of work; I was hoping that a
> bit of configuration could make the situation fair again...
> 
> Thanks,
> 
> Wout.
> 
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 8.5.425 / Virus Database: 270.14.60/2495 - Release Date:
> 11/15/09 19:50:00




Re: Preventing bots from starving other users?

2009-11-16 Thread Brent Walker
If the bot conforms why not just control its behavior by specifying
restrictions in your robots.txt?

http://www.robotstxt.org/

On Sun, Nov 15, 2009 at 9:57 AM, Wout Mertens  wrote:
> Hi there,
>
> I was wondering if HAProxy helps in the following situation:
>
> - We have a wiki site which is quite slow
> - Regular users don't have many problems
> - We also get crawled by a search bot, which creates many concurrent 
> connections, more than the hardware can handle
> - Therefore, service is degraded and users usually have their browsers time 
> out on them
>
> Given that we can't make the wiki faster, I was thinking that we could solve 
> this by having a per-source-IP queue, which made sure that a given source IP 
> cannot have more than e.g. 3 requests active at the same time. Requests 
> beyond that would get queued.
>
> Is this possible?
>
> Thanks,
>
> Wout.
>



Re: Preventing bots from starving other users?

2009-11-16 Thread Karsten Elfenbein
Just create an additional backend and assign the bots to it.
You can set queues and max connections there as needed.

Also an additional tip might be to adjust the robots.txt file as some bots can 
be slowed down.
http://www.google.com/support/webmasters/bin/answer.py?answer=48620
Check if the bots that are crawling have some real use for you, otherwise just 
adjust your robots.txt or block them.

Some stuff for basic mysql + mediawiki might be to check if the mysql 
querycache is working.

Karsten

Am Sonntag, 15. November 2009 schrieben Sie:
> Hi there,
> 
> I was wondering if HAProxy helps in the following situation:
> 
> - We have a wiki site which is quite slow
> - Regular users don't have many problems
> - We also get crawled by a search bot, which creates many concurrent
>  connections, more than the hardware can handle - Therefore, service is
>  degraded and users usually have their browsers time out on them
> 
> Given that we can't make the wiki faster, I was thinking that we could
>  solve this by having a per-source-IP queue, which made sure that a given
>  source IP cannot have more than e.g. 3 requests active at the same time.
>  Requests beyond that would get queued.
> 
> Is this possible?
> 
> Thanks,
> 
> Wout.
> 


-- 
Mit freundlichen Grüßen

Karsten Elfenbein
Entwicklung und Systemadministration

erento - Der Online-Marktplatz für Mietartikel.

erento GmbH
Friedenstrasse 91
D-10249 Berlin

Tel: +49 (30) 2000 42064
Fax: +49 (30) 2000  8499
eMail:   karsten.elfenb...@erento.com

- - - - - - - - - - - - - - - - - - - - - - - - - -
Hotline: 01805 - 373 686 (14 ct/min.)
Firmensitz der erento GmbH ist Berlin
Geschäftsführer: Chris Möller & Oliver Weyergraf
Handelsregister Berlin Charlottenburg,  HRB 101206B
- - - - - - - - - - - - - - - - - - - - - - - - - -
http://www.erento.com - alles online mieten.



Re: Preventing bots from starving other users?

2009-11-16 Thread Wout Mertens
Hi John,

On Nov 15, 2009, at 8:29 PM, John Lauro wrote:

> I would probably do that sort of throttling at the OS level with iptables,
> etc...

Hmmm How? I don't want to throw away the requests, just queue them. Looking 
for iptables rate limiting it seems that you can only drop the request.

Then again:

> That said, before that I would investigate why the wiki is so slow...
> Something probably isn't configured right if it chokes with only a few
> simultaneous accesses.  I mean, unless it's embedded server with under 32MB
> of RAM, the hardware should be able to handle that...

Yeah, it's running pretty old software on a pretty old server. It should be 
upgraded but that is a fair bit of work; I was hoping that a bit of 
configuration could make the situation fair again...

Thanks,

Wout.


Re: Preventing bots from starving other users?

2009-11-15 Thread Łukasz Jagiełło
2009/11/15 Wout Mertens :
> I was wondering if HAProxy helps in the following situation:
>
> - We have a wiki site which is quite slow
> - Regular users don't have many problems
> - We also get crawled by a search bot, which creates many concurrent 
> connections, more than the hardware can handle
> - Therefore, service is degraded and users usually have their browsers time 
> out on them
>
> Given that we can't make the wiki faster, I was thinking that we could solve 
> this by having a per-source-IP queue, which made sure that a given source IP 
> cannot have more than e.g. 3 requests active at the same time. Requests 
> beyond that would get queued.
>
> Is this possible?

Guess so. I move traffic from crawlers to special web backend cause
they mostly harvest when I got backup window and slow down everything
even more. Add request limit should be also easy. Just check docu.

-- 
Łukasz Jagiełło
System Administrator
G-Forces Web Management Polska sp. z o.o. (www.gforces.pl)

Ul. Kruczkowskiego 12, 80-288 Gdańsk
Spółka wpisana do KRS pod nr 246596 decyzją Sądu Rejonowego Gdańsk-Północ



Re: Preventing bots from starving other users?

2009-11-15 Thread Aleksandar Lazic

On Son 15.11.2009 15:57, Wout Mertens wrote:

Hi there,

I was wondering if HAProxy helps in the following situation:

- We have a wiki site which is quite slow
- Regular users don't have many problems
- We also get crawled by a search bot, which creates many concurrent
 connections, more than the hardware can handle
- Therefore, service is degraded and users usually have their browsers
 time out on them

Given that we can't make the wiki faster, I was thinking that we could
solve this by having a per-source-IP queue, which made sure that a
given source IP cannot have more than e.g. 3 requests active at the
same time. Requests beyond that would get queued.

Is this possible?


Maybe with http://haproxy.1wt.eu/download/1.3/doc/configuration.txt

src 

fe_sess_rate

In the acl section.

Maybe you get some ideas from this
http://haproxy.1wt.eu/download/1.3/doc/haproxy-en.txt

5) Access lists


Hth

Aleks



RE: Preventing bots from starving other users?

2009-11-15 Thread John Lauro
I would probably do that sort of throttling at the OS level with iptables,
etc...

That said, before that I would investigate why the wiki is so slow...
Something probably isn't configured right if it chokes with only a few
simultaneous accesses.  I mean, unless it's embedded server with under 32MB
of RAM, the hardware should be able to handle that...


> -Original Message-
> From: Wout Mertens [mailto:wout.mert...@gmail.com]
> Sent: Sunday, November 15, 2009 9:57 AM
> To: haproxy@formilux.org
> Subject: Preventing bots from starving other users?
> 
> Hi there,
> 
> I was wondering if HAProxy helps in the following situation:
> 
> - We have a wiki site which is quite slow
> - Regular users don't have many problems
> - We also get crawled by a search bot, which creates many concurrent
> connections, more than the hardware can handle
> - Therefore, service is degraded and users usually have their browsers
> time out on them
> 
> Given that we can't make the wiki faster, I was thinking that we could
> solve this by having a per-source-IP queue, which made sure that a
> given source IP cannot have more than e.g. 3 requests active at the
> same time. Requests beyond that would get queued.
> 
> 
> Is this possible?
> 
> Thanks,
> 
> Wout.
> 
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 8.5.425 / Virus Database: 270.14.60/2495 - Release Date:
> 11/15/09 07:50:00




Preventing bots from starving other users?

2009-11-15 Thread Wout Mertens
Hi there,

I was wondering if HAProxy helps in the following situation:

- We have a wiki site which is quite slow
- Regular users don't have many problems
- We also get crawled by a search bot, which creates many concurrent 
connections, more than the hardware can handle
- Therefore, service is degraded and users usually have their browsers time out 
on them

Given that we can't make the wiki faster, I was thinking that we could solve 
this by having a per-source-IP queue, which made sure that a given source IP 
cannot have more than e.g. 3 requests active at the same time. Requests beyond 
that would get queued.

Is this possible?

Thanks,

Wout.