Re: Preventing bots from starving other users?
Hi John, On Nov 15, 2009, at 8:29 PM, John Lauro wrote: I would probably do that sort of throttling at the OS level with iptables, etc... Hmmm How? I don't want to throw away the requests, just queue them. Looking for iptables rate limiting it seems that you can only drop the request. Then again: That said, before that I would investigate why the wiki is so slow... Something probably isn't configured right if it chokes with only a few simultaneous accesses. I mean, unless it's embedded server with under 32MB of RAM, the hardware should be able to handle that... Yeah, it's running pretty old software on a pretty old server. It should be upgraded but that is a fair bit of work; I was hoping that a bit of configuration could make the situation fair again... Thanks, Wout.
Re: Preventing bots from starving other users?
Just create an additional backend and assign the bots to it. You can set queues and max connections there as needed. Also an additional tip might be to adjust the robots.txt file as some bots can be slowed down. http://www.google.com/support/webmasters/bin/answer.py?answer=48620 Check if the bots that are crawling have some real use for you, otherwise just adjust your robots.txt or block them. Some stuff for basic mysql + mediawiki might be to check if the mysql querycache is working. Karsten Am Sonntag, 15. November 2009 schrieben Sie: Hi there, I was wondering if HAProxy helps in the following situation: - We have a wiki site which is quite slow - Regular users don't have many problems - We also get crawled by a search bot, which creates many concurrent connections, more than the hardware can handle - Therefore, service is degraded and users usually have their browsers time out on them Given that we can't make the wiki faster, I was thinking that we could solve this by having a per-source-IP queue, which made sure that a given source IP cannot have more than e.g. 3 requests active at the same time. Requests beyond that would get queued. Is this possible? Thanks, Wout. -- Mit freundlichen Grüßen Karsten Elfenbein Entwicklung und Systemadministration erento - Der Online-Marktplatz für Mietartikel. erento GmbH Friedenstrasse 91 D-10249 Berlin Tel: +49 (30) 2000 42064 Fax: +49 (30) 2000 8499 eMail: karsten.elfenb...@erento.com - - - - - - - - - - - - - - - - - - - - - - - - - - Hotline: 01805 - 373 686 (14 ct/min.) Firmensitz der erento GmbH ist Berlin Geschäftsführer: Chris Möller Oliver Weyergraf Handelsregister Berlin Charlottenburg, HRB 101206B - - - - - - - - - - - - - - - - - - - - - - - - - - http://www.erento.com - alles online mieten.
RE: Preventing bots from starving other users?
Oopps, my bad... It's actually tc and not iptables. Googletc qdisc for some info. You could allow your local ips go unrestricted, and throttle all other IPs to 512kb/sec for example. What software is the running on? I assume it's not running under apache or there would be some ways to tune apache. As other have mentioned, telling the crawlers to behave themselves or totally ignore the wiki with a robots file is probably best. -Original Message- From: Wout Mertens [mailto:wout.mert...@gmail.com] Sent: Monday, November 16, 2009 7:31 AM To: John Lauro Cc: haproxy@formilux.org Subject: Re: Preventing bots from starving other users? Hi John, On Nov 15, 2009, at 8:29 PM, John Lauro wrote: I would probably do that sort of throttling at the OS level with iptables, etc... Hmmm How? I don't want to throw away the requests, just queue them. Looking for iptables rate limiting it seems that you can only drop the request. Then again: That said, before that I would investigate why the wiki is so slow... Something probably isn't configured right if it chokes with only a few simultaneous accesses. I mean, unless it's embedded server with under 32MB of RAM, the hardware should be able to handle that... Yeah, it's running pretty old software on a pretty old server. It should be upgraded but that is a fair bit of work; I was hoping that a bit of configuration could make the situation fair again... Thanks, Wout. No virus found in this incoming message. Checked by AVG - www.avg.com Version: 8.5.425 / Virus Database: 270.14.60/2495 - Release Date: 11/15/09 19:50:00
RE: Preventing bots from starving other users?
You can ask (polite) bots to throttle their request rates and simultaneous requests. It think that you'd probably be quite interested in the crawl-delay directive: http://en.wikipedia.org/wiki/Robots_exclusion_standard#Crawl-delay_direc tive This is respected by at least MSN and Yahoo. Unfortunately, it looks like google may not (or may?) respect it, they propose this alternative: http://www.google.com/support/webmasters/bin/answer.py?answer=48620 Of course, if you're being scraped by a bot that doesn't respect this directive or a more malicious scraper it won't help you at all. -JohnF -Original Message- From: Wout Mertens [mailto:wout.mert...@gmail.com] Sent: November 16, 2009 9:19 AM To: John Lauro Cc: haproxy@formilux.org Subject: Re: Preventing bots from starving other users? On Nov 16, 2009, at 2:43 PM, John Lauro wrote: Oopps, my bad... It's actually tc and not iptables. Googletc qdisc for some info. You could allow your local ips go unrestricted, and throttle all other IPs to 512kb/sec for example. Hmmm... The problem isn't the data rate, it's the work associated with incoming requests. As soon as a 500 byte request hits, the web server has to do a lot of work. What software is the running on? I assume it's not running under apache or there would be some ways to tune apache. As other have mentioned, telling the crawlers to behave themselves or totally ignore the wiki with a robots file is probably best. Well the web server is Apache, but surprisingly Apache doesn't allow for tuning this particular case. Suppose normal request traffic looks like (A are users) Time - A A AA AA AAA AAA A With the bot this becomes ABB A A BBA BA AABB So you can see that normal users are just swamped out of slots. The webserver can render about 9 pages at the same time without impact, but it takes a second or more to render. At first I set MaxClients to 9, which makes it so the web server doesn't swap to death, but if the bots have 8 requests queued up, and then another 8, and another 8, regular users have no chance of decent interactivity... This may be a corner case due to slow serving, because I'm having a hard time finding a way to throttle the bots. I suppose that normally you'd just add servers... Wout.
Re: Preventing bots from starving other users?
Perhaps this plugin could be useful, never used, tho: http://twiki.org/cgi-bin/view/Plugins.TWikiCacheAddOn On Mon, Nov 16, 2009 at 11:46 AM, Wout Mertens wout.mert...@gmail.comwrote: On Nov 16, 2009, at 1:47 PM, Karsten Elfenbein wrote: Just create an additional backend and assign the bots to it. You can set queues and max connections there as needed. Yes, you're right - that's probably the best solution. I'll create an extra apache process on the same server that will handle the bot subnet. No extra hardware needed. Thanks! The wiki in question is TWiki - very flexible but very bad at caching what it does. Basically, for each page view the complete interpreter and all plugins get loaded. Wout. -- Germán Gutiérrez Infrastructure Team OLX Inc. Buenos Aires - Argentina Phone: 54.11.4775.6696 Mobile: 54.911.5669.6175 Skype: errare_est Email: germ...@olx.com Delivering common sense since 1969 Epoch Fail!. The Nature is not amiable; It treats impartially to all the things. The wise person is not amiable; He treats all people impartially.
RE: Preventing bots from starving other users?
I would probably do that sort of throttling at the OS level with iptables, etc... That said, before that I would investigate why the wiki is so slow... Something probably isn't configured right if it chokes with only a few simultaneous accesses. I mean, unless it's embedded server with under 32MB of RAM, the hardware should be able to handle that... -Original Message- From: Wout Mertens [mailto:wout.mert...@gmail.com] Sent: Sunday, November 15, 2009 9:57 AM To: haproxy@formilux.org Subject: Preventing bots from starving other users? Hi there, I was wondering if HAProxy helps in the following situation: - We have a wiki site which is quite slow - Regular users don't have many problems - We also get crawled by a search bot, which creates many concurrent connections, more than the hardware can handle - Therefore, service is degraded and users usually have their browsers time out on them Given that we can't make the wiki faster, I was thinking that we could solve this by having a per-source-IP queue, which made sure that a given source IP cannot have more than e.g. 3 requests active at the same time. Requests beyond that would get queued. Is this possible? Thanks, Wout. No virus found in this incoming message. Checked by AVG - www.avg.com Version: 8.5.425 / Virus Database: 270.14.60/2495 - Release Date: 11/15/09 07:50:00
Re: Preventing bots from starving other users?
On Son 15.11.2009 15:57, Wout Mertens wrote: Hi there, I was wondering if HAProxy helps in the following situation: - We have a wiki site which is quite slow - Regular users don't have many problems - We also get crawled by a search bot, which creates many concurrent connections, more than the hardware can handle - Therefore, service is degraded and users usually have their browsers time out on them Given that we can't make the wiki faster, I was thinking that we could solve this by having a per-source-IP queue, which made sure that a given source IP cannot have more than e.g. 3 requests active at the same time. Requests beyond that would get queued. Is this possible? Maybe with http://haproxy.1wt.eu/download/1.3/doc/configuration.txt src ip_address fe_sess_rate In the acl section. Maybe you get some ideas from this http://haproxy.1wt.eu/download/1.3/doc/haproxy-en.txt 5) Access lists Hth Aleks
Re: Preventing bots from starving other users?
2009/11/15 Wout Mertens wout.mert...@gmail.com: I was wondering if HAProxy helps in the following situation: - We have a wiki site which is quite slow - Regular users don't have many problems - We also get crawled by a search bot, which creates many concurrent connections, more than the hardware can handle - Therefore, service is degraded and users usually have their browsers time out on them Given that we can't make the wiki faster, I was thinking that we could solve this by having a per-source-IP queue, which made sure that a given source IP cannot have more than e.g. 3 requests active at the same time. Requests beyond that would get queued. Is this possible? Guess so. I move traffic from crawlers to special web backend cause they mostly harvest when I got backup window and slow down everything even more. Add request limit should be also easy. Just check docu. -- Łukasz Jagiełło System Administrator G-Forces Web Management Polska sp. z o.o. (www.gforces.pl) Ul. Kruczkowskiego 12, 80-288 Gdańsk Spółka wpisana do KRS pod nr 246596 decyzją Sądu Rejonowego Gdańsk-Północ