RE: Problems with long connect times
> -Original Message- > From: Willy Tarreau [mailto:w...@1wt.eu] > Sent: Wednesday, October 14, 2009 12:38 PM > To: Jonah Horowitz > Cc: Hank A. Paulson; haproxy@formilux.org > Subject: Re: Problems with long connect times > > Hi Jonah, > > On Wed, Oct 14, 2009 at 12:31:07AM -0700, Jonah Horowitz wrote: > > > > driver: tg3 > > version: 3.98 > > firmware-version: 5721-v3.55a > > bus-info: :03:00.0 > > OK this is fine. > > > Not running bnx2. Looks like it's not a 65563 limit either, I've > been > > graphing it and it's up to 80k sometimes, but it goes up and down. > > OK. > > > When it fails, it seems like it's either 3 seconds or 9 seconds. > Would tcp > > retransmits cause that? > > yes, that's what I immediately observed on your graphs. Multiples of 3s > are a typical consequence of TCP drops. Since the back-off algorithm is > exponential, you have 3s, 6s, 12s, 24s ... between each retransmit. So > having 3s and 9s implies that you sometimes lose one packet (3s) and > sometimes two (3s+6s). The fact that you don't observe 6s implies that > all packets are lost in the same direction. > > Also, generally such timers are only observable for initial packets > (SYN, SYN-ACK, ACK) because as soon as there is traffic, a drop is > more quickly detected because the other end does not ack it at after > several packets. > > And a retransmit on SYNs are most often caused by saturated session > tables somewhere (local nf_conntrack module, or any firewall between > you and the other place). Oh, something else can happen. If you reach > your servers through a PIX or FWSM firewall or at least one that > randomizes sequence numbers, the other server will not always be > able to accept a new connection for a source port that it has in > TIME_WAIT, because the initial sequence number will not be greater > than the previous one due to the random. Then the server will return > a pure ACK instead of a SYN-ACK, to which your haproxy machine will > respond with an RST, then a SYN later upon retransmit. The thing here is, there's no firewall or any device that should be doing connection tracking between the haproxy node an the internet. We're using some iptables rules, but nf_conntrack is disabled in the kernel configuration. > > The only way to detect this is to put a sniffer on both ends and > compare sequence numbers. They must match. If not, you have such a > nasty thing in the middle that needs to be fixed (for PIX and FWSM, > there is an option I don't remember for that). > > > I just compiled a kernel with a default retransmit > > of 1sec, but I haven't tested it yet. > > > > Here's the output of netstat -s: > > Tcp: > > 2059992268 active connections openings > > 1933849278 passive connection openings > > 4543998 failed connection attempts > > 2093186 connection resets received > > 142 connections established > > 3547584716 segments received > > 3643865881 segments send out > > 20003371 segments retransmited > > This seems to be a lot. Almost 1% of retransmits ! > > > 0 bad segments received. > > 6179288 resets sent > > And this one could confirm the sequence number randomization > hypothesis. > > > > UdpLite: > > TcpExt: > > 4237091 resets received for embryonic SYN_RECV sockets > > 1915476798 TCP sockets finished time wait in fast timer > > 28901367 time wait sockets recycled by time stamp > > 119887 packets rejects in established connections because of > timestamp > > 2171355337 delayed acks sent > > 292818 delayed acks further delayed because of locked socket > > Quick ack mode was activated 697528 times > > 15213 times the listen queue of a socket overflowed > > 15213 SYNs to LISTEN sockets dropped > > That is not very good, you seem to have a slightly too small SYN > backlog queue. Or maybe this only happens during manipulations ? How do I determine the size of my SYN backlog queue, and how do I increase it? Thanks again, Jonah
Re: Problems with long connect times
driver: tg3 version: 3.98 firmware-version: 5721-v3.55a bus-info: :03:00.0 Not running bnx2. Looks like it's not a 65563 limit either, I've been graphing it and it's up to 80k sometimes, but it goes up and down. When it fails, it seems like it's either 3 seconds or 9 seconds. Would tcp retransmits cause that? I just compiled a kernel with a default retransmit of 1sec, but I haven't tested it yet. Here's the output of netstat -s: IcmpMsg: InType0: 18 InType3: 50818 InType8: 699 OutType0: 699 OutType3: 50841 OutType8: 18 Tcp: 2059992268 active connections openings 1933849278 passive connection openings 4543998 failed connection attempts 2093186 connection resets received 142 connections established 3547584716 segments received 3643865881 segments send out 20003371 segments retransmited 0 bad segments received. 6179288 resets sent UdpLite: TcpExt: 4237091 resets received for embryonic SYN_RECV sockets 1915476798 TCP sockets finished time wait in fast timer 28901367 time wait sockets recycled by time stamp 119887 packets rejects in established connections because of timestamp 2171355337 delayed acks sent 292818 delayed acks further delayed because of locked socket Quick ack mode was activated 697528 times 15213 times the listen queue of a socket overflowed 15213 SYNs to LISTEN sockets dropped 2125065 packets directly queued to recvmsg prequeue. 18179 bytes directly in process context from backlog 7564477 bytes directly received in process context from prequeue 3465788360 packet headers predicted 7232 packets header predicted and directly queued to user 2567319929 acknowledgments not containing data payload received 2718897 predicted acknowledgments 80328 times recovered from packet loss by selective acknowledgements Detected reordering 3118 times using FACK Detected reordering 46 times using SACK Detected reordering 32513 times using time stamp 55394 congestion windows fully recovered without slow start 44249 congestion windows partially recovered using Hoe heuristic 115 congestion windows recovered without slow start by DSACK 101091 congestion windows recovered without slow start after partial ack 4019 TCP data loss events TCPLostRetransmit: 17 11 timeouts after reno fast retransmit 443124 timeouts after SACK recovery 266 timeouts in loss state 83502 fast retransmits 33980 forward retransmits 8964 retransmits in slow start 4227010 other TCP timeouts 421 SACK retransmits failed 698471 DSACKs sent for old packets 118559 DSACKs received 34 DSACKs for out of order packets received 868905 connections reset due to unexpected data 2054320 connections reset due to early user close 1876779 connections aborted due to timeout TCPSACKDiscard: 1820 TCPDSACKIgnoredOld: 110422 TCPDSACKIgnoredNoUndo: 4762 TCPSpuriousRTOs: 18 TCPSackShifted: 9702 TCPSackMerged: 59174 TCPSackShiftFallback: 71815157 IpExt: InMcastPkts: 8816 OutMcastPkts: 3589637 InBcastPkts: 29338 Thanks again for all your help. Jonah On 10/13/09 9:37 PM, "Willy Tarreau" wrote: > On Tue, Oct 13, 2009 at 12:52:55PM -0700, Jonah Horowitz wrote: >> netstat -ant | grep tcp | tr -s ' ' ' ' | awk '{print $6}' | sort | uniq >> -c >>193 CLOSE_WAIT >>316 CLOSING >>215 ESTABLISHED >>252 FIN_WAIT1 >> 4 FIN_WAIT2 >> 1 LAST_ACK >> 10 LISTEN >>237 SYN_RECV >> 61384 TIME_WAIT >> >> So, clearly there's a time_wait problem. I've already tuned the kernel >> to set the time_wait counter to 20 seconds (down from 60). I'm tempted >> to crank it down further, although googling around recommends against >> it. Is it possible to up the number of outstanding time_wait >> connections? This host looks like it's hitting a 65536 connection >> limit. > > No, TIME_WAIT are not an issue, and are even normal. It's useless to > try to reduce them, your proxy can simply re-use them. The only case > where it is not possible is when the proxy closed the connection first > (eg: "option forceclose") but your config does not have this. > > I'm more concerned by the SYN_RECV which indicate that you did not > get an ACK from a client. I'm suspecting you have a high packet loss > rate. What type of NIC are you running from ? Wouldn't this be a > bnx2 with firmware 1.9.6 ? (use "ethtool -i eth0"). If so, you must > find a firmware on your vendor's site and upgrade it, as this one > is very common and very buggy. > > Regards, > Willy > -- Jonah Horowitz · Monitoring Manager · jhorow...@looksmart.net W: 415-348-7694 · F: 415-348-7033 · M: 415-513-7202 LookSmart - Premium and Performance Advertising Solutions 625 Second Street, San Francisco, CA 94107
RE: Problems with long connect times
netstat -ant | grep tcp | tr -s ' ' ' ' | awk '{print $6}' | sort | uniq -c 193 CLOSE_WAIT 316 CLOSING 215 ESTABLISHED 252 FIN_WAIT1 4 FIN_WAIT2 1 LAST_ACK 10 LISTEN 237 SYN_RECV 61384 TIME_WAIT So, clearly there's a time_wait problem. I've already tuned the kernel to set the time_wait counter to 20 seconds (down from 60). I'm tempted to crank it down further, although googling around recommends against it. Is it possible to up the number of outstanding time_wait connections? This host looks like it's hitting a 65536 connection limit. > -Original Message- > From: Hank A. Paulson [mailto:h...@spamproof.nospammail.net] > Sent: Monday, October 12, 2009 9:14 PM > To: haproxy@formilux.org > Subject: Re: Problems with long connect times > > A couple of guesses you might look at - > I have found the stats page to show deceptively low numbers at times. > You might want to check the http log stats that show the > global/frontend/backend queue numbers around the time those requests. > My guess > is that the cases where you are seeing 3 second times it is that the > backends > are slow to connect or they have reached maxconn. Also, you might want > to > double check that the clients are sending the requests in a timely > fashion. > > netstat -ant | wc -l > > do you have conntrack running as in the recent situation here on the > ml? > Any other messages in /var/log/messages? > netstat -s have any growing stats? > > I assume you have lots backends if they are all at only maxconn 20 > > > On 10/12/09 5:15 PM, Jonah Horowitz wrote: > > I'm having a problem where occasionally under load, the time to > complete > > the tcp handshake is taking much longer than it should: > > > > Picture (Device Independent Bitmap) > > > > My suspicion is that the number of connections available to the > haproxy > > server are some how constrained and it can't answer connections for a > > moment. I'm not sure how to debug this. Has anyone else seen > something > > like this? > > > > According to the haproxy stats page, I've never come close to my > > connection limit. I'm using about 1000 concurrent connections and my > > request rate maxes out at 4400 requests per second. I'm not seeing > any > > messages in dmesg or my /var/log/messages. > > > > I'm running 1.4-dev3 on Linux 2.6.30.5. My config is below: > > > > TIA, > > > > Jonah > > > > --- compile options --- > > > > make USE_REGPARM=1 USE_STATIC_PCRE=1 USE_LINUX_SPLICE=1 > TARGET=linux26 > > CPU_CFLAGS='-O2 -march=x86-64 -m64' > > > > --- config --- > > > > global > > > > maxconn 2000 > > > > pidfile /usr/pkg/haproxy/run/haproxy.pid > > > > stats socket /usr/pkg/haproxy/run/stats > > > > log /usr/pkg/haproxy/jail/log daemon > > > > user daemon > > > > group daemon > > > > defaults > > > > timeout queue 3000 > > > > timeout server 3000 > > > > timeout client 3000 > > > > timeout connect 3000 > > > > option splice-auto > > > > frontend stats > > > > bind :8080 > > > > mode http > > > > use_backend stats if TRUE > > > > backend stats > > > > mode http > > > > stats enable > > > > stats uri /stats > > > > stats refresh 5s > > > > frontend query > > > > log global > > > > option dontlog-normal > > > > option httplog > > > > bind :80 > > > > mode http > > > > use_backend query if TRUE > > > > backend query > > > > mode http > > > > balance roundrobin > > > > option httpchk GET /r?q=LOOKSMARTKEYWORDLISTINGMONITOR&isp=DROPus > > > > option forwardfor > > > > option httpclose > > > > server foo1 foo1:8080 weight 150 maxconn 20 check inter 1000 rise 2 > fall 1 > > > > server foo2 foo2:8080 weight 150 maxconn 20 check inter 1000 rise 2 > fall 1 > > > > server foo2 foo3:8080 weight 150 maxconn 20 check inter 1000 rise 2 > fall 1 > > > > ... > >
RE: Kernel tuning recommendations
I ended up just building a kernel without conntrack, module or otherwise. I'm sure you could prevent conntrack from loading somehow, but this was easier from my perspective. Jonah > -Original Message- > From: Michael Marano [mailto:mmar...@futureus.com] > Sent: Wednesday, October 07, 2009 3:03 PM > To: ch...@sargy.co.uk > Cc: haproxy@formilux.org; Mark Kramer > Subject: Re: Kernel tuning recommendations > > I've made a handful of changes based up on Chris and Willy's > suggestions, > which I've included below. This avoids the nf_conntrack errors in the > logs. > > I would like to skip nf_conntrack altogether. I've been digging around > to > try to learn how to do that, but I now admit I don't know how. I can't > just > drop the module, as it's currently in use. > > [mmar...@w1 w1]$ sudo modprobe -n -r nf_conntrack > FATAL: Module nf_conntrack is in use. > > What do I need to change in my iptables rules to pave the way for > removing > this module. Once I've got that straight, how do I then disable the > module. > I'm happy to get an RTFM response if I'm just being stupid. Point me at > the > right M ;) > > Michael Marano > > > iptables rules script --- > #!/bin/sh > > sudo /sbin/iptables -F > sudo /sbin/iptables -A INPUT -i lo -j ACCEPT > sudo /sbin/iptables -A INPUT -i ! lo -d 127.0.0.0/8 -j REJECT > sudo /sbin/iptables -A INPUT -m state --state ESTABLISHED,RELATED -j > ACCEPT > sudo /sbin/iptables -A OUTPUT -j ACCEPT > > # don't track incoming or outgoing port 80 > sudo /sbin/iptables -t raw -A PREROUTING -p tcp --dport 80 -j NOTRACK > sudo /sbin/iptables -t raw -A PREROUTING -p tcp --dport 8080 -j NOTRACK > sudo /sbin/iptables -t raw -A PREROUTING -p tcp --dport 81 -j NOTRACK > > # don't track traffic starting from the private ip > sudo /sbin/iptables -t raw -A PREROUTING -p tcp -s 10.176.45.165 -j > NOTRACK > > # these may not actually be useful, but I'm leaving them in. > sudo /sbin/iptables -t raw -A OUTPUT -p tcp --sport 80 -j NOTRACK > sudo /sbin/iptables -t raw -A OUTPUT -p tcp --sport 8080 -j NOTRACK > sudo /sbin/iptables -t raw -A OUTPUT -p tcp --sport 81 -j NOTRACK > > sudo /sbin/iptables -A INPUT -p tcp -m state --state NEW --dport 22 -j > ACCEPT > sudo /sbin/iptables -A INPUT -p icmp -m icmp --icmp-type 8 -j ACCEPT > sudo /sbin/iptables -A INPUT -j REJECT > sudo /sbin/iptables -A FORWARD -j REJECT > iptables rules script --- > > > > additions to sysctl.conf --- > # > # TCP tuning > # > # from > http://agiletesting.blogspot.com/2009/03/haproxy-and-apache- > performance-tuni > ng.html > net.ipv4.tcp_tw_reuse = 1 > net.ipv4.ip_local_port_range = 1024 65023 > net.ipv4.tcp_max_syn_backlog = 10240 > net.ipv4.tcp_max_tw_buckets = 40 > net.ipv4.tcp_max_orphans = 6 > net.ipv4.tcp_synack_retries = 3 > net.core.somaxconn = 4 > > # from > http://serverfault.com/questions/11106/best-linux-network-tuning-tips > net.ipv4.route.max_size = 262144 > net.ipv4.netfilter.ip_conntrack_tcp_timeout_established = 18000 > net.ipv4.neigh.default.gc_thresh1 = 1024 > net.ipv4.neigh.default.gc_thresh2 = 2048 > net.ipv4.neigh.default.gc_thresh3 = 4096 > net.netfilter.nf_conntrack_max = 128000 > net.netfilter.nf_conntrack_expect_max = 4096 > > # additions based on questions to the haproxy mailing list > # http://www.mail-archive.com/haproxy@formilux.org/msg01321.html > net.ipv4.tcp_timestamps = 1 > net.core.netdev_max_backlog = 4 > # these were all lower than the default values already set, so I left > them > out > #net.ipv4.tcp_rmem = 4096 8192 16384 > #net.ipv4.tcp_wmem = 4096 8192 16384 > #net.ipv4.tcp_mem = 65536 98304 131072 > > additions to sysctl.conf --- > > > > > From: > > Date: Wed, 07 Oct 2009 11:24:23 +0100 > > To: Michael Marano > > Cc: > > Subject: Re: Kernel tuning recommendations > > > > Here is the adjusted IPv4 settings I use on my haproxy box - I picked > > these up from around the web, and they seem to work for me, not that > > they are in use on a particularly high volume site currently. > > > > Chris > > > > net.ipv4.tcp_tw_reuse = 1 > > net.ipv4.ip_local_port_range = 1024 65023 > > net.ipv4.tcp_max_syn_backlog = 10240 > > net.ipv4.tcp_max_tw_buckets = 40 > > net.ipv4.tcp_max_orphans = 6 > > net.ipv4.tcp_synack_retries = 3 > > net.ipv4.tcp_max_syn_backlog = 45000 > > net.ipv4.tcp_timestamps = 1 > > net.ipv4.tcp_rmem = 4096 8192 16384 > > net.ipv4.tcp_wmem = 4096 8192 16384 > > net.ipv4.tcp_mem = 65536 98304 131072 > > net.core.somaxconn = 4 > > net.core.netdev_max_backlog = 4 > > > > > > > > Quoting Michael Marano : > > > >> Subsequent load tests proved me wrong. I¹m still getting the > nf_conntrack > >> messages. Perhaps I¹ve misconfigigured my iptables rules? > >> > >> > >> # bits of /var/log/messages > >> > >> Oct 6 21:58:40 w1 kernel: [3718555.091684] printk: 2 messages > suppressed. > >> Oct 6 21:58:40 w1 kernel: [3718555.091705] nf_connt
RE: Nbproc question
Here's the output of top on the system: top - 09:50:36 up 4 days, 18:50, 1 user, load average: 1.31, 1.59, 1.55 Tasks: 117 total, 2 running, 115 sleeping, 0 stopped, 0 zombie Cpu(s): 2.5%us, 9.9%sy, 0.0%ni, 75.0%id, 0.0%wa, 0.5%hi, 12.1%si, 0.0%st Mem: 8179536k total, 997748k used, 7181788k free, 139236k buffers Swap: 9976356k total,0k used, 9976356k free, 460396k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 752741 daemon20 0 34760 24m 860 R 100 0.3 871:15.76 haproxy It's a quad core system, but haproxy is taking 100% of one core. We're doing less than 5k req/sec and the box has two 2.6ghz Opterons in it. Do you know how much health checks affect cpu utilization of an haproxy process? We have about 100 backend servers and we're running "inter 500 rise 2 fall 1" I haven't tried adjusting that, although when it was set to the default our error rates were much higher. Thanks, Jonah -Original Message- From: Willy Tarreau [mailto:w...@1wt.eu] Sent: Monday, September 28, 2009 9:50 PM To: Jonah Horowitz Cc: haproxy@formilux.org Subject: Re: Nbproc question On Mon, Sep 28, 2009 at 06:43:58PM -0700, Jonah Horowitz wrote: > In the documentation it seems to discourage using the nbproc directive. > What¹s the situation with this? I¹m running a server with 8 cores, so I¹m > tempted to up the nbproc. Is the process normally multithreaded? no the process is not multithreaded. > Is nbproc > something I can use for performance tuning, or is it just for file handles? It can bring you small performance gains at the expense of a more complex monitoring, since the stats will still only reflect the process which receives the stats request. Also, health-checks will be performed by each process, causing an increased load on your servers. And the connection limitation will not work anymore, as any process won't know that there are other processes already using a server. It was initially designed to workaround per-process file handle limitations on some systems, but it is true that it brings a minor performance advantage. However, considering that you can reach 4 connections per second with a single process on a cheap core2duo 2.66 GHz, and that forwarding data at 10 Gbps on this machine consumes only 20% of a core, you can certainly understand why I don't see the situations where it would make sense to use nbproc. Regards, Willy
Nbproc question
In the documentation it seems to discourage using the nbproc directive. What¹s the situation with this? I¹m running a server with 8 cores, so I¹m tempted to up the nbproc. Is the process normally multithreaded? Is nbproc something I can use for performance tuning, or is it just for file handles? -- Jonah Horowitz · Monitoring Manager · jhorow...@looksmart.net W: 415-348-7694 · F: 415-348-7033 · M: 415-513-7202 LookSmart - Premium and Performance Advertising Solutions 625 Second Street, San Francisco, CA 94107
RE: artificial maxconn imposed
I fixed the nf_contrack problem with this (really just the first one, but the others were good too). HAProxy sysctl changes For network tuning, add the following to /etc/sysctl.conf: net.ipv4.netfilter.ip_conntrack_max = 16777216 net.ipv4.tcp_max_tw_buckets = 16777216 increase TCP max buffer size setable using setsockopt() net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 increase Linux autotuning TCP buffer limits min, default, and max number of bytes to use set max to at least 4MB, or higher if you use very high BDP paths net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 -jonah -Original Message- From: David Birdsong [mailto:david.birds...@gmail.com] Sent: Friday, September 18, 2009 3:06 PM To: haproxy Subject: artificial maxconn imposed I've set ulimit -n 2 maxconn in defaults is 16384 and still somehow when i check the stats page,maxconn is limited to 1, sure enough requests start piling up. any suggestions on where else to look? i'm sure it's an OS thing, so: Fedora 10 x86_64 16GB of RAM this command doesn't turn anything up find /proc/sys/net/ipv4 -type f -exec cat {} \; | grep 1 (also dmesg shows nf_conntrack: table full, dropping packet.) which i think is another problem. might be time to switch to a *BSD.
Backend Server UP/Down Debugging?
I’m watching my servers on the back end and occasionally they flap. I’m wondering if there is a way to see why they are taken out of service. I’d like to see the actual response, or at least a HTTP status code. Jonah Horowitz · Monitoring Manager · jhorow...@looksmart.net <mailto:jhorow...@looksmart.net> w: 415.348.7694 · c: 415.513.7202 · f: 415.348.7020 625 Second Street, San Francisco, CA 94107
Re: realtime switch to another backend if got 5xx error?
I'm trying to figure out how this works. I desperately need to figure out a way to monitor servers and either take any server that sends any 5xx error out of rotation, or failing that, at least redirect the query to a different server. The clients that use this web service are SOAP/XML clients, so they're not "real" web browsers. Also, we don't use any cookies. It looks like this config just tells the client to make a second request. Am I missing something here? I know I can use httpchk, but I don't want to run "inter 1" because then all my traffic is monitoring traffic. Each server is normally doing several hundred requests per second, and our haproxy test setup is a couple orders of magnitude higher on % of 500 errors. (10% vs .01%). Any ideas? Thanks, Jonah On 6/11/09 7:45 AM, "Maciej Bogucki" wrote: > Dawid Sieradzki / Gadu-Gadu S.A. pisze: >> Hi. >> >> The problem is how to silent switch to another backend in realtime if >> got 500 answer from backend, without http_client knowledge >> Yes i know, httpchk, but the error 500 is 10 per hour, we don't know >> when and why. >> So, it is a race who get 500 first - httpchk or http_client. >> >> If You don't know what i mean: >> >> example config: >> >> 8< >> >> frontend >> (..) >> default_backend back_1 >> >> backend back_1 >>option httpchk GET /index.php HTTP/1.1\r\nHost:\ test.pl >>mode http >>retries 10 >>balance roundrobin >> >> server chk1 127.0.0.1:81 weight 1 check >> server chk2 127.0.0.1:82 weight 1 check >> server chk3 127.0.0.1:83 weight 1 check backup >> >> >8-- >> >> http_client -> haproxy -> (backend1|backend2|backend3) >> >> let's go inside request: >> >> A. haproxy recived request from http_client >> B. haproxy sent request from http_client to backend1 >> C. backend1 said 500 internal server error >> >> I want: :-) >> D. haproxy sent request from_http to backend2 (or backup backend or >> another one, or one more time to backend1) >> >> I have: :-( >> D. haproxy sent 500 internal server error to http_client from backend1 >> E. haproxy will mark backend1 as down if got 2 > errror 500 from backend1 >> >> >> It is possible to do that? >> > Hello, > > Yes it is possible but it could be dengerous for some kinde of > application fe. billing system ;) > Here is an example how to do it. I know that it is the hack but it works > good ;P > > frontend fr1 > default_backend back_1 > rspirep ^HTTP/...\ [23]0..* \0\nSet-Cookie:\ > cookiexxx=0;path=/;domain=.yourdomain.com > rspirep ^(HTTP/...)\ 5[0-9][0-9].* \1\ 202\ Again\ > Please\nSet-Cookie:\ > cookiexxx=1;path=/;domain=.yourdomain.com\nRefresh:\ 6\nContent-Length:\ > Lenght_xxx\nContent-Type:\ text/html\n\n src="http://www.yourdomain.com/redispatch.pl";> > > backend back_1 > cookie cookiexxx > server chk1 127.0.0.1:81 weight 1 check > server chk2 127.0.0.1:82 weight 1 check > server chk3 127.0.0.1:83 weight 1 check cookie 1 backup > > Remember to set Lenght_xxx properly. > > Best Regards > Maciej Bogucki > > -- Jonah Horowitz · Monitoring Manager · jhorow...@looksmart.net W: 415-348-7694 · F: 415-348-7033 · M: 415-513-7202 LookSmart - Premium and Performance Advertising Solutions 625 Second Street, San Francisco, CA 94107
Re: HAProxy - Inline Monitoring?
Willy, I can see why, with some web farms, you wouldn't want to take servers out of rotation after just one, or a few 5xx errors, particularly since often they are caused by bad user input. In our case, any 5xx errors are almost always an indication that the server in question is in a bad state. Particularly problematic is that a server serving 5xx errors tends to do so much faster than one responding to legitimate requests. This means that a bad server can serve several thousand requests before the next health check kicks it out of service. Implementing inline monitoring dropped our 5xx error rate by two orders of magnitude, so it is pretty important for us. If we move forward, we'll likely submit a patch if the functionality doesn't exist as things stand now. Perhaps it would be better if it was a counter that took a server out after a set number of consecutive failed requests. Jonah On 5/24/09 9:56 PM, "Willy Tarreau" wrote: > Hi, > > On Fri, May 22, 2009 at 11:37:14AM -0700, Jonah Horowitz wrote: >> I¹m currently testing HAProxy for deployment. Right now we use NetScaler >> load balancers, and the provide a feature called ³inline monitoring². With >> inline monitoring the Netscaler will take a server out of rotation if it >> responds with a 5xx error to a client response. It does this separate from >> standard health checks. Is there a way to do this with HAProxy? > > No, and I don't want to do the same as it seems a little bit risky to me. > However what is planned is to switch to fast health-checks when a number > of 5xx errors is encountered. That way, it would significantly reduce the > time to detect a server failure without the risk of taking a server out of > the farm on random errors. > > Regards, > Willy > -- Jonah Horowitz · Monitoring Manager · jhorow...@looksmart.net W: 415-348-7694 · F: 415-348-7033 · M: 415-513-7202 LookSmart - Premium and Performance Advertising Solutions 625 Second Street, San Francisco, CA 94107 smime.p7s Description: S/MIME cryptographic signature
HAProxy - Inline Monitoring?
I¹m currently testing HAProxy for deployment. Right now we use NetScaler load balancers, and the provide a feature called ³inline monitoring². With inline monitoring the Netscaler will take a server out of rotation if it responds with a 5xx error to a client response. It does this separate from standard health checks. Is there a way to do this with HAProxy? -- Jonah Horowitz · Monitoring Manager · jhorow...@looksmart.net W: 415-348-7694 · F: 415-348-7033 · M: 415-513-7202 LookSmart - Premium and Performance Advertising Solutions 625 Second Street, San Francisco, CA 94107 smime.p7s Description: S/MIME cryptographic signature