Re: Preventing bots from starving other users?
Hi, On Mon, Nov 16, 2009 at 04:33:34PM +0100, Wout Mertens wrote: > Schweet! I'll give that a shot. If you want to experiment a bit, with version 1.4 (development), you can even add a delay to all the requests from this boat. The idea is to identify the bot with an ACL and tell the TCP layer to wait for the full evaluation time before forwarding the request : For instance, let's say that the bot does not set any user-agent. We then consider that any request with a user agent is a valid request : frontend xxx ... acl valid_req hdr_cnt(user-agent) gt 0 tcp-request inspect-delay 5s# the time to wait for those which match tcp-request content accept if HTTP valid_req # valid request passes tcp-request content accept if HTTP WAIT_END # other ones wait tcp-request content reject# non-HTTP is rejected You can already do that with 1.3.22 but only based on a layer 4 information (namely, the source IP address) : acl valid_src src 192.168.0.0/16 tcp-request inspect-delay 5s# the time to wait for those which match tcp-request content accept if valid_src # valid request passes tcp-request content accept if WAIT_END # other ones wait Or if you know the bot : acl bot_src src 10.20.30.40 tcp-request inspect-delay 5s# the time to wait for those which match tcp-request content accept if bot_src WAIT_END # bot waits tcp-request content accept # other ones pass With 1.4, it is even possible to combine that with cookies. Imagine that you add a small delay (eg: 1 second) for the first request of every user, then assign them a cookie and don't set the delay after that. If the bot does not learn the cookie (very likely), it will always suffer from the delay, for each request : frontend xxx acl seen hdr_sub(cookie) SEEN=1 tcp-request inspect-delay 1s # the time to wait for new users tcp-request content accept if HTTP seen # valid request passes tcp-request content accept if HTTP WAIT_END # other ones wait tcp-request content reject # non-HTTP is rejected rspadd Set-Cookie: SEEN=1# do not harm real browsers Good luck ! Willy
Re: Preventing bots from starving other users?
Schweet! I'll give that a shot. Wout. On Nov 16, 2009, at 4:08 PM, Karsten Elfenbein wrote: > you can just create the backend in haproxy and use the same backend server > definition > no need to reconfigure apache > > put like 7 max sessions for normal users on one backend and 2 for maxsessions > on the bot backend > throw in some queues and you are set > > Karsten > > Am Montag, 16. November 2009 schrieben Sie: >> On Nov 16, 2009, at 1:47 PM, Karsten Elfenbein wrote: >>> Just create an additional backend and assign the bots to it. >>> You can set queues and max connections there as needed. >> >> Yes, you're right - that's probably the best solution. I'll create an extra >> apache process on the same server that will handle the bot subnet. No >> extra hardware needed. Thanks! >> >> The wiki in question is TWiki - very flexible but very bad at caching what >> it does. Basically, for each page view the complete interpreter and all >> plugins get loaded. >> >> Wout. >> > > > -- > Mit freundlichen Grüßen > > Karsten Elfenbein > Entwicklung und Systemadministration > > erento - Der Online-Marktplatz für Mietartikel. > > erento GmbH > Friedenstrasse 91 > D-10249 Berlin > > Tel: +49 (30) 2000 42064 > Fax: +49 (30) 2000 8499 > eMail: karsten.elfenb...@erento.com > > - - - - - - - - - - - - - - - - - - - - - - - - - - > Hotline: 01805 - 373 686 (14 ct/min.) > Firmensitz der erento GmbH ist Berlin > Geschäftsführer: Chris Möller & Oliver Weyergraf > Handelsregister Berlin Charlottenburg, HRB 101206B > - - - - - - - - - - - - - - - - - - - - - - - - - - > http://www.erento.com - alles online mieten.
Re: Preventing bots from starving other users?
you can just create the backend in haproxy and use the same backend server definition no need to reconfigure apache put like 7 max sessions for normal users on one backend and 2 for maxsessions on the bot backend throw in some queues and you are set Karsten Am Montag, 16. November 2009 schrieben Sie: > On Nov 16, 2009, at 1:47 PM, Karsten Elfenbein wrote: > > Just create an additional backend and assign the bots to it. > > You can set queues and max connections there as needed. > > Yes, you're right - that's probably the best solution. I'll create an extra > apache process on the same server that will handle the bot subnet. No > extra hardware needed. Thanks! > > The wiki in question is TWiki - very flexible but very bad at caching what > it does. Basically, for each page view the complete interpreter and all > plugins get loaded. > > Wout. > -- Mit freundlichen Grüßen Karsten Elfenbein Entwicklung und Systemadministration erento - Der Online-Marktplatz für Mietartikel. erento GmbH Friedenstrasse 91 D-10249 Berlin Tel: +49 (30) 2000 42064 Fax: +49 (30) 2000 8499 eMail: karsten.elfenb...@erento.com - - - - - - - - - - - - - - - - - - - - - - - - - - Hotline: 01805 - 373 686 (14 ct/min.) Firmensitz der erento GmbH ist Berlin Geschäftsführer: Chris Möller & Oliver Weyergraf Handelsregister Berlin Charlottenburg, HRB 101206B - - - - - - - - - - - - - - - - - - - - - - - - - - http://www.erento.com - alles online mieten.
Re: Preventing bots from starving other users?
Perhaps this plugin could be useful, never used, tho: http://twiki.org/cgi-bin/view/Plugins.TWikiCacheAddOn On Mon, Nov 16, 2009 at 11:46 AM, Wout Mertens wrote: > On Nov 16, 2009, at 1:47 PM, Karsten Elfenbein wrote: > > > Just create an additional backend and assign the bots to it. > > You can set queues and max connections there as needed. > > Yes, you're right - that's probably the best solution. I'll create an extra > apache process on the same server that will handle the bot subnet. No extra > hardware needed. Thanks! > > The wiki in question is TWiki - very flexible but very bad at caching what > it does. Basically, for each page view the complete interpreter and all > plugins get loaded. > > Wout. > -- Germán Gutiérrez Infrastructure Team OLX Inc. Buenos Aires - Argentina Phone: 54.11.4775.6696 Mobile: 54.911.5669.6175 Skype: errare_est Email: germ...@olx.com Delivering common sense since 1969 . The Nature is not amiable; It treats impartially to all the things. The wise person is not amiable; He treats all people impartially.
Re: Preventing bots from starving other users?
On Nov 16, 2009, at 1:47 PM, Karsten Elfenbein wrote: > Just create an additional backend and assign the bots to it. > You can set queues and max connections there as needed. Yes, you're right - that's probably the best solution. I'll create an extra apache process on the same server that will handle the bot subnet. No extra hardware needed. Thanks! The wiki in question is TWiki - very flexible but very bad at caching what it does. Basically, for each page view the complete interpreter and all plugins get loaded. Wout.
RE: Preventing bots from starving other users?
Make sure you set KeepAlive to off in Apache. That keeps more than one request being queued at a time without multiple connections being open. You can also have haproxy do this for you with option httpclose even if it's enabled in Apache. You could then use --histcount with iptables rules and limit on the number of connections / sec based on ip addresses... > -Original Message- > From: Wout Mertens [mailto:wout.mert...@gmail.com] > Sent: Monday, November 16, 2009 9:19 AM > To: John Lauro > Cc: haproxy@formilux.org > Subject: Re: Preventing bots from starving other users? > > On Nov 16, 2009, at 2:43 PM, John Lauro wrote: > > > Oopps, my bad... It's actually tc and not iptables. Googletc > qdisc > > for some info. > > > > You could allow your local ips go unrestricted, and throttle all > other IPs > > to 512kb/sec for example. > > Hmmm... The problem isn't the data rate, it's the work associated with > incoming requests. As soon as a 500 byte request hits, the web server > has to do a lot of work. > > > What software is the running on? I assume it's not running under > apache or > > there would be some ways to tune apache. As other have mentioned, > telling > > the crawlers to behave themselves or totally ignore the wiki with a > robots > > file is probably best. > > Well the web server is Apache, but surprisingly Apache doesn't allow > for tuning this particular case. Suppose normal request traffic looks > like (A are users) > > Time -> > > A A AA AA AAA AAA A > > With the bot this becomes > > ABB A A BBA BA AABB > > So you can see that normal users are just swamped out of "slots". The > webserver can render about 9 pages at the same time without impact, but > it takes a second or more to render. At first I set MaxClients to 9, > which makes it so the web server doesn't swap to death, but if the bots > have 8 requests queued up, and then another 8, and another 8, regular > users have no chance of decent interactivity... > > This may be a corner case due to slow serving, because I'm having a > hard time finding a way to throttle the bots. I suppose that normally > you'd just add servers... > > Wout. > > No virus found in this incoming message. > Checked by AVG - www.avg.com > Version: 8.5.425 / Virus Database: 270.14.60/2495 - Release Date: > 11/16/09 07:43:00
RE: Preventing bots from starving other users?
You can ask (polite) bots to throttle their request rates and simultaneous requests. It think that you'd probably be quite interested in the crawl-delay directive: http://en.wikipedia.org/wiki/Robots_exclusion_standard#Crawl-delay_direc tive This is respected by at least MSN and Yahoo. Unfortunately, it looks like google may not (or may?) respect it, they propose this alternative: http://www.google.com/support/webmasters/bin/answer.py?answer=48620 Of course, if you're being scraped by a bot that doesn't respect this directive or a more malicious scraper it won't help you at all. -JohnF > -Original Message- > From: Wout Mertens [mailto:wout.mert...@gmail.com] > Sent: November 16, 2009 9:19 AM > To: John Lauro > Cc: haproxy@formilux.org > Subject: Re: Preventing bots from starving other users? > > On Nov 16, 2009, at 2:43 PM, John Lauro wrote: > > > Oopps, my bad... It's actually tc and not iptables. > Googletc qdisc > > for some info. > > > > You could allow your local ips go unrestricted, and > throttle all other IPs > > to 512kb/sec for example. > > Hmmm... The problem isn't the data rate, it's the work > associated with incoming requests. As soon as a 500 byte > request hits, the web server has to do a lot of work. > > > What software is the running on? I assume it's not running > under apache or > > there would be some ways to tune apache. As other have > mentioned, telling > > the crawlers to behave themselves or totally ignore the > wiki with a robots > > file is probably best. > > Well the web server is Apache, but surprisingly Apache > doesn't allow for tuning this particular case. Suppose normal > request traffic looks like (A are users) > > Time -> > > A A AA AA AAA AAA A > > With the bot this becomes > > ABB A A BBA BA AABB > > So you can see that normal users are just swamped out of > "slots". The webserver can render about 9 pages at the same > time without impact, but it takes a second or more to render. > At first I set MaxClients to 9, which makes it so the web > server doesn't swap to death, but if the bots have 8 requests > queued up, and then another 8, and another 8, regular users > have no chance of decent interactivity... > > This may be a corner case due to slow serving, because I'm > having a hard time finding a way to throttle the bots. I > suppose that normally you'd just add servers... > > Wout. >
Re: Preventing bots from starving other users?
On Nov 16, 2009, at 2:43 PM, John Lauro wrote: > Oopps, my bad... It's actually tc and not iptables. Googletc qdisc > for some info. > > You could allow your local ips go unrestricted, and throttle all other IPs > to 512kb/sec for example. Hmmm... The problem isn't the data rate, it's the work associated with incoming requests. As soon as a 500 byte request hits, the web server has to do a lot of work. > What software is the running on? I assume it's not running under apache or > there would be some ways to tune apache. As other have mentioned, telling > the crawlers to behave themselves or totally ignore the wiki with a robots > file is probably best. Well the web server is Apache, but surprisingly Apache doesn't allow for tuning this particular case. Suppose normal request traffic looks like (A are users) Time -> A A AA AA AAA AAA A With the bot this becomes ABB A A BBA BA AABB So you can see that normal users are just swamped out of "slots". The webserver can render about 9 pages at the same time without impact, but it takes a second or more to render. At first I set MaxClients to 9, which makes it so the web server doesn't swap to death, but if the bots have 8 requests queued up, and then another 8, and another 8, regular users have no chance of decent interactivity... This may be a corner case due to slow serving, because I'm having a hard time finding a way to throttle the bots. I suppose that normally you'd just add servers... Wout.
RE: Preventing bots from starving other users?
Oopps, my bad... It's actually tc and not iptables. Googletc qdisc for some info. You could allow your local ips go unrestricted, and throttle all other IPs to 512kb/sec for example. What software is the running on? I assume it's not running under apache or there would be some ways to tune apache. As other have mentioned, telling the crawlers to behave themselves or totally ignore the wiki with a robots file is probably best. > -Original Message- > From: Wout Mertens [mailto:wout.mert...@gmail.com] > Sent: Monday, November 16, 2009 7:31 AM > To: John Lauro > Cc: haproxy@formilux.org > Subject: Re: Preventing bots from starving other users? > > Hi John, > > On Nov 15, 2009, at 8:29 PM, John Lauro wrote: > > > I would probably do that sort of throttling at the OS level with > iptables, > > etc... > > Hmmm How? I don't want to throw away the requests, just queue them. > Looking for iptables rate limiting it seems that you can only drop the > request. > > Then again: > > > That said, before that I would investigate why the wiki is so slow... > > Something probably isn't configured right if it chokes with only a > few > > simultaneous accesses. I mean, unless it's embedded server with > under 32MB > > of RAM, the hardware should be able to handle that... > > Yeah, it's running pretty old software on a pretty old server. It > should be upgraded but that is a fair bit of work; I was hoping that a > bit of configuration could make the situation fair again... > > Thanks, > > Wout. > > No virus found in this incoming message. > Checked by AVG - www.avg.com > Version: 8.5.425 / Virus Database: 270.14.60/2495 - Release Date: > 11/15/09 19:50:00
Re: Preventing bots from starving other users?
If the bot conforms why not just control its behavior by specifying restrictions in your robots.txt? http://www.robotstxt.org/ On Sun, Nov 15, 2009 at 9:57 AM, Wout Mertens wrote: > Hi there, > > I was wondering if HAProxy helps in the following situation: > > - We have a wiki site which is quite slow > - Regular users don't have many problems > - We also get crawled by a search bot, which creates many concurrent > connections, more than the hardware can handle > - Therefore, service is degraded and users usually have their browsers time > out on them > > Given that we can't make the wiki faster, I was thinking that we could solve > this by having a per-source-IP queue, which made sure that a given source IP > cannot have more than e.g. 3 requests active at the same time. Requests > beyond that would get queued. > > Is this possible? > > Thanks, > > Wout. >
Re: Preventing bots from starving other users?
Just create an additional backend and assign the bots to it. You can set queues and max connections there as needed. Also an additional tip might be to adjust the robots.txt file as some bots can be slowed down. http://www.google.com/support/webmasters/bin/answer.py?answer=48620 Check if the bots that are crawling have some real use for you, otherwise just adjust your robots.txt or block them. Some stuff for basic mysql + mediawiki might be to check if the mysql querycache is working. Karsten Am Sonntag, 15. November 2009 schrieben Sie: > Hi there, > > I was wondering if HAProxy helps in the following situation: > > - We have a wiki site which is quite slow > - Regular users don't have many problems > - We also get crawled by a search bot, which creates many concurrent > connections, more than the hardware can handle - Therefore, service is > degraded and users usually have their browsers time out on them > > Given that we can't make the wiki faster, I was thinking that we could > solve this by having a per-source-IP queue, which made sure that a given > source IP cannot have more than e.g. 3 requests active at the same time. > Requests beyond that would get queued. > > Is this possible? > > Thanks, > > Wout. > -- Mit freundlichen Grüßen Karsten Elfenbein Entwicklung und Systemadministration erento - Der Online-Marktplatz für Mietartikel. erento GmbH Friedenstrasse 91 D-10249 Berlin Tel: +49 (30) 2000 42064 Fax: +49 (30) 2000 8499 eMail: karsten.elfenb...@erento.com - - - - - - - - - - - - - - - - - - - - - - - - - - Hotline: 01805 - 373 686 (14 ct/min.) Firmensitz der erento GmbH ist Berlin Geschäftsführer: Chris Möller & Oliver Weyergraf Handelsregister Berlin Charlottenburg, HRB 101206B - - - - - - - - - - - - - - - - - - - - - - - - - - http://www.erento.com - alles online mieten.
Re: Preventing bots from starving other users?
Hi John, On Nov 15, 2009, at 8:29 PM, John Lauro wrote: > I would probably do that sort of throttling at the OS level with iptables, > etc... Hmmm How? I don't want to throw away the requests, just queue them. Looking for iptables rate limiting it seems that you can only drop the request. Then again: > That said, before that I would investigate why the wiki is so slow... > Something probably isn't configured right if it chokes with only a few > simultaneous accesses. I mean, unless it's embedded server with under 32MB > of RAM, the hardware should be able to handle that... Yeah, it's running pretty old software on a pretty old server. It should be upgraded but that is a fair bit of work; I was hoping that a bit of configuration could make the situation fair again... Thanks, Wout.
Re: Preventing bots from starving other users?
2009/11/15 Wout Mertens : > I was wondering if HAProxy helps in the following situation: > > - We have a wiki site which is quite slow > - Regular users don't have many problems > - We also get crawled by a search bot, which creates many concurrent > connections, more than the hardware can handle > - Therefore, service is degraded and users usually have their browsers time > out on them > > Given that we can't make the wiki faster, I was thinking that we could solve > this by having a per-source-IP queue, which made sure that a given source IP > cannot have more than e.g. 3 requests active at the same time. Requests > beyond that would get queued. > > Is this possible? Guess so. I move traffic from crawlers to special web backend cause they mostly harvest when I got backup window and slow down everything even more. Add request limit should be also easy. Just check docu. -- Łukasz Jagiełło System Administrator G-Forces Web Management Polska sp. z o.o. (www.gforces.pl) Ul. Kruczkowskiego 12, 80-288 Gdańsk Spółka wpisana do KRS pod nr 246596 decyzją Sądu Rejonowego Gdańsk-Północ
Re: Preventing bots from starving other users?
On Son 15.11.2009 15:57, Wout Mertens wrote: Hi there, I was wondering if HAProxy helps in the following situation: - We have a wiki site which is quite slow - Regular users don't have many problems - We also get crawled by a search bot, which creates many concurrent connections, more than the hardware can handle - Therefore, service is degraded and users usually have their browsers time out on them Given that we can't make the wiki faster, I was thinking that we could solve this by having a per-source-IP queue, which made sure that a given source IP cannot have more than e.g. 3 requests active at the same time. Requests beyond that would get queued. Is this possible? Maybe with http://haproxy.1wt.eu/download/1.3/doc/configuration.txt src fe_sess_rate In the acl section. Maybe you get some ideas from this http://haproxy.1wt.eu/download/1.3/doc/haproxy-en.txt 5) Access lists Hth Aleks
RE: Preventing bots from starving other users?
I would probably do that sort of throttling at the OS level with iptables, etc... That said, before that I would investigate why the wiki is so slow... Something probably isn't configured right if it chokes with only a few simultaneous accesses. I mean, unless it's embedded server with under 32MB of RAM, the hardware should be able to handle that... > -Original Message- > From: Wout Mertens [mailto:wout.mert...@gmail.com] > Sent: Sunday, November 15, 2009 9:57 AM > To: haproxy@formilux.org > Subject: Preventing bots from starving other users? > > Hi there, > > I was wondering if HAProxy helps in the following situation: > > - We have a wiki site which is quite slow > - Regular users don't have many problems > - We also get crawled by a search bot, which creates many concurrent > connections, more than the hardware can handle > - Therefore, service is degraded and users usually have their browsers > time out on them > > Given that we can't make the wiki faster, I was thinking that we could > solve this by having a per-source-IP queue, which made sure that a > given source IP cannot have more than e.g. 3 requests active at the > same time. Requests beyond that would get queued. > > > Is this possible? > > Thanks, > > Wout. > > No virus found in this incoming message. > Checked by AVG - www.avg.com > Version: 8.5.425 / Virus Database: 270.14.60/2495 - Release Date: > 11/15/09 07:50:00
Preventing bots from starving other users?
Hi there, I was wondering if HAProxy helps in the following situation: - We have a wiki site which is quite slow - Regular users don't have many problems - We also get crawled by a search bot, which creates many concurrent connections, more than the hardware can handle - Therefore, service is degraded and users usually have their browsers time out on them Given that we can't make the wiki faster, I was thinking that we could solve this by having a per-source-IP queue, which made sure that a given source IP cannot have more than e.g. 3 requests active at the same time. Requests beyond that would get queued. Is this possible? Thanks, Wout.