RE: Problems with long connect times

2009-10-14 Thread Jonah Horowitz


> -Original Message-
> From: Willy Tarreau [mailto:w...@1wt.eu]
> Sent: Wednesday, October 14, 2009 12:38 PM
> To: Jonah Horowitz
> Cc: Hank A. Paulson; haproxy@formilux.org
> Subject: Re: Problems with long connect times
> 
> Hi Jonah,
> 
> On Wed, Oct 14, 2009 at 12:31:07AM -0700, Jonah Horowitz wrote:
> >
> > driver: tg3
> > version: 3.98
> > firmware-version: 5721-v3.55a
> > bus-info: :03:00.0
> 
> OK this is fine.
> 
> > Not running bnx2.  Looks like it's not a 65563 limit either, I've
> been
> > graphing it and it's up to 80k sometimes, but it goes up and down.
> 
> OK.
> 
> > When it fails, it seems like it's either 3 seconds or 9 seconds.
> Would tcp
> > retransmits cause that?
> 
> yes, that's what I immediately observed on your graphs. Multiples of
3s
> are a typical consequence of TCP drops. Since the back-off algorithm
is
> exponential, you have 3s, 6s, 12s, 24s ... between each retransmit. So
> having 3s and 9s implies that you sometimes lose one packet (3s) and
> sometimes two (3s+6s). The fact that you don't observe 6s implies that
> all packets are lost in the same direction.
> 
> Also, generally such timers are only observable for initial packets
> (SYN, SYN-ACK, ACK) because as soon as there is traffic, a drop is
> more quickly detected because the other end does not ack it at after
> several packets.
> 
> And a retransmit on SYNs are most often caused by saturated session
> tables somewhere (local nf_conntrack module, or any firewall between
> you and the other place). Oh, something else can happen. If you reach
> your servers through a PIX or FWSM firewall or at least one that
> randomizes sequence numbers, the other server will not always be
> able to accept a new connection for a source port that it has in
> TIME_WAIT, because the initial sequence number will not be greater
> than the previous one due to the random. Then the server will return
> a pure ACK instead of a SYN-ACK, to which your haproxy machine will
> respond with an RST, then a SYN later upon retransmit.

The thing here is, there's no firewall or any device that should be
doing connection tracking between the haproxy node an the internet.
We're using some iptables rules, but nf_conntrack is disabled in the
kernel configuration.

> 
> The only way to detect this is to put a sniffer on both ends and
> compare sequence numbers. They must match. If not, you have such a
> nasty thing in the middle that needs to be fixed (for PIX and FWSM,
> there is an option I don't remember for that).
> 
> > I just compiled a kernel with a default retransmit
> > of 1sec, but I haven't tested it yet.
> >
> > Here's the output of netstat -s:
> > Tcp:
> > 2059992268 active connections openings
> > 1933849278 passive connection openings
> > 4543998 failed connection attempts
> > 2093186 connection resets received
> > 142 connections established
> > 3547584716 segments received
> > 3643865881 segments send out
> > 20003371 segments retransmited
> 
> This seems to be a lot. Almost 1% of retransmits !
> 
> > 0 bad segments received.
> > 6179288 resets sent
> 
> And this one could confirm the sequence number randomization
> hypothesis.
> 
> 
> > UdpLite:
> > TcpExt:
> > 4237091 resets received for embryonic SYN_RECV sockets
> > 1915476798 TCP sockets finished time wait in fast timer
> > 28901367 time wait sockets recycled by time stamp
> > 119887 packets rejects in established connections because of
> timestamp
> > 2171355337 delayed acks sent
> > 292818 delayed acks further delayed because of locked socket
> > Quick ack mode was activated 697528 times
> > 15213 times the listen queue of a socket overflowed
> > 15213 SYNs to LISTEN sockets dropped
> 
> That is not very good, you seem to have a slightly too small SYN
> backlog queue. Or maybe this only happens during manipulations ?

How do I determine the size of my SYN backlog queue, and how do I
increase it?

Thanks again,

Jonah



Re: Problems with long connect times

2009-10-14 Thread Jonah Horowitz

driver: tg3
version: 3.98
firmware-version: 5721-v3.55a
bus-info: :03:00.0

Not running bnx2.  Looks like it's not a 65563 limit either, I've been
graphing it and it's up to 80k sometimes, but it goes up and down.

When it fails, it seems like it's either 3 seconds or 9 seconds.  Would tcp
retransmits cause that?  I just compiled a kernel with a default retransmit
of 1sec, but I haven't tested it yet.

Here's the output of netstat -s:

IcmpMsg:
InType0: 18
InType3: 50818
InType8: 699
OutType0: 699
OutType3: 50841
OutType8: 18
Tcp:
2059992268 active connections openings
1933849278 passive connection openings
4543998 failed connection attempts
2093186 connection resets received
142 connections established
3547584716 segments received
3643865881 segments send out
20003371 segments retransmited
0 bad segments received.
6179288 resets sent
UdpLite:
TcpExt:
4237091 resets received for embryonic SYN_RECV sockets
1915476798 TCP sockets finished time wait in fast timer
28901367 time wait sockets recycled by time stamp
119887 packets rejects in established connections because of timestamp
2171355337 delayed acks sent
292818 delayed acks further delayed because of locked socket
Quick ack mode was activated 697528 times
15213 times the listen queue of a socket overflowed
15213 SYNs to LISTEN sockets dropped
2125065 packets directly queued to recvmsg prequeue.
18179 bytes directly in process context from backlog
7564477 bytes directly received in process context from prequeue
3465788360 packet headers predicted
7232 packets header predicted and directly queued to user
2567319929 acknowledgments not containing data payload received
2718897 predicted acknowledgments
80328 times recovered from packet loss by selective acknowledgements
Detected reordering 3118 times using FACK
Detected reordering 46 times using SACK
Detected reordering 32513 times using time stamp
55394 congestion windows fully recovered without slow start
44249 congestion windows partially recovered using Hoe heuristic
115 congestion windows recovered without slow start by DSACK
101091 congestion windows recovered without slow start after partial ack
4019 TCP data loss events
TCPLostRetransmit: 17
11 timeouts after reno fast retransmit
443124 timeouts after SACK recovery
266 timeouts in loss state
83502 fast retransmits
33980 forward retransmits
8964 retransmits in slow start
4227010 other TCP timeouts
421 SACK retransmits failed
698471 DSACKs sent for old packets
118559 DSACKs received
34 DSACKs for out of order packets received
868905 connections reset due to unexpected data
2054320 connections reset due to early user close
1876779 connections aborted due to timeout
TCPSACKDiscard: 1820
TCPDSACKIgnoredOld: 110422
TCPDSACKIgnoredNoUndo: 4762
TCPSpuriousRTOs: 18
TCPSackShifted: 9702
TCPSackMerged: 59174
TCPSackShiftFallback: 71815157
IpExt:
InMcastPkts: 8816
OutMcastPkts: 3589637
InBcastPkts: 29338


Thanks again for all your help.

Jonah


On 10/13/09 9:37 PM, "Willy Tarreau"  wrote:

> On Tue, Oct 13, 2009 at 12:52:55PM -0700, Jonah Horowitz wrote:
>> netstat -ant | grep tcp | tr -s ' ' ' ' | awk '{print $6}' | sort | uniq
>> -c
>>193 CLOSE_WAIT
>>316 CLOSING
>>215 ESTABLISHED
>>252 FIN_WAIT1
>>  4 FIN_WAIT2
>>  1 LAST_ACK
>> 10 LISTEN
>>237 SYN_RECV
>>  61384 TIME_WAIT
>> 
>> So, clearly there's a time_wait problem.  I've already tuned the kernel
>> to set the time_wait counter to 20 seconds (down from 60).  I'm tempted
>> to crank it down further, although googling around recommends against
>> it.  Is it possible to up the number of outstanding time_wait
>> connections?  This host looks like it's hitting a 65536 connection
>> limit.
> 
> No, TIME_WAIT are not an issue, and are even normal. It's useless to
> try to reduce them, your proxy can simply re-use them. The only case
> where it is not possible is when the proxy closed the connection first
> (eg: "option forceclose") but your config does not have this.
> 
> I'm more concerned by the SYN_RECV which indicate that you did not
> get an ACK from a client. I'm suspecting you have a high packet loss
> rate. What type of NIC are you running from ? Wouldn't this be a
> bnx2 with firmware 1.9.6 ? (use "ethtool -i eth0"). If so, you must
> find a firmware on your vendor's site and upgrade it, as this one
> is very common and very buggy.
> 
> Regards,
> Willy
> 

-- 
Jonah Horowitz · Monitoring Manager · jhorow...@looksmart.net
W: 415-348-7694 · F: 415-348-7033 · M: 415-513-7202
LookSmart - Premium and Performance Advertising Solutions
625 Second Street, San Francisco, CA 94107





RE: Problems with long connect times

2009-10-13 Thread Jonah Horowitz
netstat -ant | grep tcp | tr -s ' ' ' ' | awk '{print $6}' | sort | uniq
-c
   193 CLOSE_WAIT
   316 CLOSING
   215 ESTABLISHED
   252 FIN_WAIT1
 4 FIN_WAIT2
 1 LAST_ACK
10 LISTEN
   237 SYN_RECV
 61384 TIME_WAIT

So, clearly there's a time_wait problem.  I've already tuned the kernel
to set the time_wait counter to 20 seconds (down from 60).  I'm tempted
to crank it down further, although googling around recommends against
it.  Is it possible to up the number of outstanding time_wait
connections?  This host looks like it's hitting a 65536 connection
limit.



> -Original Message-
> From: Hank A. Paulson [mailto:h...@spamproof.nospammail.net]
> Sent: Monday, October 12, 2009 9:14 PM
> To: haproxy@formilux.org
> Subject: Re: Problems with long connect times
> 
> A couple of guesses you might look at -
> I have found the stats page to show deceptively low numbers at times.
> You might want to check the http log stats that show the
> global/frontend/backend queue numbers around the time those requests.
> My guess
> is that the cases where you are seeing 3 second times it is that the
> backends
> are slow to connect or they have reached maxconn. Also, you might want
> to
> double check that the clients are sending the requests in a timely
> fashion.
> 
> netstat -ant | wc -l
> 
> do you have conntrack running as in the recent situation here on the
> ml?
> Any other messages in /var/log/messages?
> netstat -s have any growing stats?
> 
> I assume you have lots backends if they are all at only maxconn 20
> 
> 
> On 10/12/09 5:15 PM, Jonah Horowitz wrote:
> > I'm having a problem where occasionally under load, the time to
> complete
> > the tcp handshake is taking much longer than it should:
> >
> > Picture (Device Independent Bitmap)
> >
> > My suspicion is that the number of connections available to the
> haproxy
> > server are some how constrained and it can't answer connections for
a
> > moment. I'm not sure how to debug this. Has anyone else seen
> something
> > like this?
> >
> > According to the haproxy stats page, I've never come close to my
> > connection limit. I'm using about 1000 concurrent connections and my
> > request rate maxes out at 4400 requests per second. I'm not seeing
> any
> > messages in dmesg or my /var/log/messages.
> >
> > I'm running 1.4-dev3 on Linux 2.6.30.5. My config is below:
> >
> > TIA,
> >
> > Jonah
> >
> > --- compile options ---
> >
> > make USE_REGPARM=1 USE_STATIC_PCRE=1 USE_LINUX_SPLICE=1
> TARGET=linux26
> > CPU_CFLAGS='-O2 -march=x86-64 -m64'
> >
> > --- config ---
> >
> > global
> >
> > maxconn 2000
> >
> > pidfile /usr/pkg/haproxy/run/haproxy.pid
> >
> > stats socket /usr/pkg/haproxy/run/stats
> >
> > log /usr/pkg/haproxy/jail/log daemon
> >
> > user daemon
> >
> > group daemon
> >
> > defaults
> >
> > timeout queue 3000
> >
> > timeout server 3000
> >
> > timeout client 3000
> >
> > timeout connect 3000
> >
> > option splice-auto
> >
> > frontend stats
> >
> > bind :8080
> >
> > mode http
> >
> > use_backend stats if TRUE
> >
> > backend stats
> >
> > mode http
> >
> > stats enable
> >
> > stats uri /stats
> >
> > stats refresh 5s
> >
> > frontend query
> >
> > log global
> >
> > option dontlog-normal
> >
> > option httplog
> >
> > bind :80
> >
> > mode http
> >
> > use_backend query if TRUE
> >
> > backend query
> >
> > mode http
> >
> > balance roundrobin
> >
> > option httpchk GET /r?q=LOOKSMARTKEYWORDLISTINGMONITOR&isp=DROPus
> >
> > option forwardfor
> >
> > option httpclose
> >
> > server foo1 foo1:8080 weight 150 maxconn 20 check inter 1000 rise 2
> fall 1
> >
> > server foo2 foo2:8080 weight 150 maxconn 20 check inter 1000 rise 2
> fall 1
> >
> > server foo2 foo3:8080 weight 150 maxconn 20 check inter 1000 rise 2
> fall 1
> >
> > ...
> >




RE: Kernel tuning recommendations

2009-10-07 Thread Jonah Horowitz
I ended up just building a kernel without conntrack, module or otherwise.  I'm 
sure you could prevent conntrack from loading somehow, but this was easier from 
my perspective.

Jonah


> -Original Message-
> From: Michael Marano [mailto:mmar...@futureus.com]
> Sent: Wednesday, October 07, 2009 3:03 PM
> To: ch...@sargy.co.uk
> Cc: haproxy@formilux.org; Mark Kramer
> Subject: Re: Kernel tuning recommendations
> 
> I've made a handful of changes based up on Chris and Willy's
> suggestions,
> which I've included below.  This avoids the nf_conntrack errors in the
> logs.
> 
> I would like to skip nf_conntrack altogether.  I've been digging around
> to
> try to learn how to do that, but I now admit I don't know how.  I can't
> just
> drop the module, as it's currently in use.
> 
> [mmar...@w1 w1]$ sudo modprobe -n -r nf_conntrack
> FATAL: Module nf_conntrack is in use.
> 
> What do I need to change in my iptables rules to pave the way for
> removing
> this module.  Once I've got that straight, how do I then disable the
> module.
> I'm happy to get an RTFM response if I'm just being stupid. Point me at
> the
> right M ;)
> 
> Michael Marano
> 
> 
>  iptables rules script ---
> #!/bin/sh
> 
> sudo /sbin/iptables -F
> sudo /sbin/iptables -A INPUT -i lo -j ACCEPT
> sudo /sbin/iptables -A INPUT -i ! lo -d 127.0.0.0/8 -j REJECT
> sudo /sbin/iptables -A INPUT -m state --state ESTABLISHED,RELATED -j
> ACCEPT
> sudo /sbin/iptables -A OUTPUT -j ACCEPT
> 
> # don't track incoming or outgoing port 80
> sudo /sbin/iptables -t raw -A PREROUTING -p tcp --dport 80 -j NOTRACK
> sudo /sbin/iptables -t raw -A PREROUTING -p tcp --dport 8080 -j NOTRACK
> sudo /sbin/iptables -t raw -A PREROUTING -p tcp --dport 81 -j NOTRACK
> 
> # don't track traffic starting from the private ip
> sudo /sbin/iptables -t raw -A PREROUTING -p tcp -s 10.176.45.165 -j
> NOTRACK
> 
> # these may not actually be useful, but I'm leaving them in.
> sudo /sbin/iptables -t raw -A OUTPUT -p tcp --sport 80 -j NOTRACK
> sudo /sbin/iptables -t raw -A OUTPUT -p tcp --sport 8080 -j NOTRACK
> sudo /sbin/iptables -t raw -A OUTPUT -p tcp --sport 81 -j NOTRACK
> 
> sudo /sbin/iptables -A INPUT -p tcp -m state --state NEW --dport 22 -j
> ACCEPT
> sudo /sbin/iptables -A INPUT -p icmp -m icmp --icmp-type 8 -j ACCEPT
> sudo /sbin/iptables -A INPUT -j REJECT
> sudo /sbin/iptables -A FORWARD -j REJECT
>  iptables rules script ---
> 
> 
> 
>  additions to sysctl.conf ---
> #
> # TCP tuning
> #
> # from
> http://agiletesting.blogspot.com/2009/03/haproxy-and-apache-
> performance-tuni
> ng.html
> net.ipv4.tcp_tw_reuse = 1
> net.ipv4.ip_local_port_range = 1024 65023
> net.ipv4.tcp_max_syn_backlog = 10240
> net.ipv4.tcp_max_tw_buckets = 40
> net.ipv4.tcp_max_orphans = 6
> net.ipv4.tcp_synack_retries = 3
> net.core.somaxconn = 4
> 
> # from
> http://serverfault.com/questions/11106/best-linux-network-tuning-tips
> net.ipv4.route.max_size = 262144
> net.ipv4.netfilter.ip_conntrack_tcp_timeout_established = 18000
> net.ipv4.neigh.default.gc_thresh1 = 1024
> net.ipv4.neigh.default.gc_thresh2 = 2048
> net.ipv4.neigh.default.gc_thresh3 = 4096
> net.netfilter.nf_conntrack_max = 128000
> net.netfilter.nf_conntrack_expect_max = 4096
> 
> # additions based on questions to the haproxy mailing list
> # http://www.mail-archive.com/haproxy@formilux.org/msg01321.html
> net.ipv4.tcp_timestamps = 1
> net.core.netdev_max_backlog = 4
> # these were all lower than the default values already set, so I left
> them
> out
> #net.ipv4.tcp_rmem = 4096 8192 16384
> #net.ipv4.tcp_wmem = 4096 8192 16384
> #net.ipv4.tcp_mem = 65536 98304 131072
> 
>  additions to sysctl.conf ---
> 
> 
> 
> > From: 
> > Date: Wed, 07 Oct 2009 11:24:23 +0100
> > To: Michael Marano 
> > Cc: 
> > Subject: Re: Kernel tuning recommendations
> >
> > Here is the adjusted IPv4 settings I use on my haproxy box - I picked
> > these up from around the web, and they seem to work for me, not that
> > they are in use on a particularly high volume site currently.
> >
> > Chris
> >
> > net.ipv4.tcp_tw_reuse = 1
> > net.ipv4.ip_local_port_range = 1024 65023
> > net.ipv4.tcp_max_syn_backlog = 10240
> > net.ipv4.tcp_max_tw_buckets = 40
> > net.ipv4.tcp_max_orphans = 6
> > net.ipv4.tcp_synack_retries = 3
> > net.ipv4.tcp_max_syn_backlog = 45000
> > net.ipv4.tcp_timestamps = 1
> > net.ipv4.tcp_rmem = 4096 8192 16384
> > net.ipv4.tcp_wmem = 4096 8192 16384
> > net.ipv4.tcp_mem = 65536 98304 131072
> > net.core.somaxconn = 4
> > net.core.netdev_max_backlog = 4
> >
> >
> >
> > Quoting Michael Marano :
> >
> >> Subsequent load tests proved me wrong.  I¹m still getting the
> nf_conntrack
> >> messages.  Perhaps I¹ve misconfigigured my iptables rules?
> >>
> >>
> >> # bits of /var/log/messages
> >>
> >> Oct  6 21:58:40 w1 kernel: [3718555.091684] printk: 2 messages
> suppressed.
> >> Oct  6 21:58:40 w1 kernel: [3718555.091705] nf_connt

RE: Nbproc question

2009-09-29 Thread Jonah Horowitz
Here's the output of top on the system:

top - 09:50:36 up 4 days, 18:50,  1 user,  load average: 1.31, 1.59, 1.55
Tasks: 117 total,   2 running, 115 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.5%us,  9.9%sy,  0.0%ni, 75.0%id,  0.0%wa,  0.5%hi, 12.1%si,  0.0%st
Mem:   8179536k total,   997748k used,  7181788k free,   139236k buffers
Swap:  9976356k total,0k used,  9976356k free,   460396k cached

PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND  
  
 752741 daemon20   0 34760  24m  860 R  100  0.3 871:15.76 haproxy  
  

It's a quad core system, but haproxy is taking 100% of one core.

We're doing less than 5k req/sec and the box has two 2.6ghz Opterons in it.

Do you know how much health checks affect cpu utilization of an haproxy process?

We have about 100 backend servers and we're running "inter 500 rise 2 fall 1"

I haven't tried adjusting that, although when it was set to the default our 
error rates were much higher.

Thanks,

Jonah


-Original Message-
From: Willy Tarreau [mailto:w...@1wt.eu] 
Sent: Monday, September 28, 2009 9:50 PM
To: Jonah Horowitz
Cc: haproxy@formilux.org
Subject: Re: Nbproc question

On Mon, Sep 28, 2009 at 06:43:58PM -0700, Jonah Horowitz wrote:
> In the documentation it seems to discourage using the nbproc directive.
> What¹s the situation with this?  I¹m running a server with 8 cores, so I¹m
> tempted to up the nbproc.  Is the process normally multithreaded?

no the process is not multithreaded.

> Is nbproc
> something I can use for performance tuning, or is it just for file handles?

It can bring you small performance gains at the expense of a more
complex monitoring, since the stats will still only reflect the
process which receives the stats request. Also, health-checks will
be performed by each process, causing an increased load on your
servers. And the connection limitation will not work anymore, as
any process won't know that there are other processes already
using a server.

It was initially designed to workaround per-process file handle
limitations on some systems, but it is true that it brings a minor
performance advantage.

However, considering that you can reach 4 connections per second
with a single process on a cheap core2duo 2.66 GHz, and that forwarding
data at 10 Gbps on this machine consumes only 20% of a core, you can
certainly understand why I don't see the situations where it would
make sense to use nbproc.

Regards,
Willy




Nbproc question

2009-09-28 Thread Jonah Horowitz
In the documentation it seems to discourage using the nbproc directive.
What¹s the situation with this?  I¹m running a server with 8 cores, so I¹m
tempted to up the nbproc.  Is the process normally multithreaded?  Is nbproc
something I can use for performance tuning, or is it just for file handles?

-- 
Jonah Horowitz · Monitoring Manager · jhorow...@looksmart.net
W: 415-348-7694 · F: 415-348-7033 · M: 415-513-7202
LookSmart - Premium and Performance Advertising Solutions
625 Second Street, San Francisco, CA 94107




RE: artificial maxconn imposed

2009-09-18 Thread Jonah Horowitz
I fixed the nf_contrack problem with this (really just the first one,
but the others were good too).

HAProxy sysctl changes

For network tuning, add the following to /etc/sysctl.conf:

net.ipv4.netfilter.ip_conntrack_max = 16777216
net.ipv4.tcp_max_tw_buckets = 16777216

increase TCP max buffer size setable using setsockopt()

net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

increase Linux autotuning TCP buffer limits min, default, and max number
of bytes to use set max to at least 4MB, or higher if you use very high
BDP paths

net.ipv4.tcp_rmem = 4096 87380 16777216  
net.ipv4.tcp_wmem = 4096 65536 16777216

-jonah

-Original Message-
From: David Birdsong [mailto:david.birds...@gmail.com] 
Sent: Friday, September 18, 2009 3:06 PM
To: haproxy
Subject: artificial maxconn imposed

I've set ulimit -n 2

maxconn in defaults is 16384 and still somehow when i check the stats
page,maxconn is limited to 1, sure enough requests start piling
up.

any suggestions on where else to look?  i'm sure it's an OS thing, so:

Fedora 10 x86_64 16GB of RAM

this command doesn't turn anything up
find /proc/sys/net/ipv4 -type f -exec cat {} \; | grep 1


(also dmesg shows nf_conntrack: table full, dropping packet.) which i
think is another problem.  might be time to switch to a *BSD.




Backend Server UP/Down Debugging?

2009-08-26 Thread Jonah Horowitz
I’m watching my servers on the back end and occasionally they flap.  I’m 
wondering if there is a way to see why they are taken out of service.  I’d like 
to see the actual response, or at least a HTTP status code.

 

Jonah Horowitz · Monitoring Manager · jhorow...@looksmart.net 
<mailto:jhorow...@looksmart.net> 

w: 415.348.7694 · c: 415.513.7202 · f: 415.348.7020

625 Second Street, San Francisco, CA 94107

 



Re: realtime switch to another backend if got 5xx error?

2009-07-30 Thread Jonah Horowitz
I'm trying to figure out how this works.  I desperately need to figure out a
way to monitor servers and either take any server that sends any 5xx error
out of rotation, or failing that, at least redirect the query to a different
server.

The clients that use this web service are SOAP/XML clients, so they're not
"real" web browsers.  Also, we don't use any cookies.

It looks like this config just tells the client to make a second request.
Am I missing something here?

I know I can use httpchk, but I don't want to run "inter 1" because then all
my traffic is monitoring traffic.  Each server is normally doing several
hundred requests per second, and our haproxy test setup is a couple orders
of magnitude higher on % of 500 errors. (10% vs .01%).

Any ideas?

Thanks,

Jonah



On 6/11/09 7:45 AM, "Maciej Bogucki"  wrote:

> Dawid Sieradzki / Gadu-Gadu S.A. pisze:
>> Hi.
>> 
>> The problem is how to silent switch to another backend in realtime if
>> got 500 answer from backend, without http_client knowledge
>> Yes i know, httpchk, but the error 500 is 10 per hour, we don't know
>> when and why.
>> So, it is a race who get 500 first - httpchk or http_client.
>> 
>> If You don't know what i mean:
>> 
>> example config:
>> 
>> 8<
>> 
>> frontend
>> (..)
>> default_backend back_1
>> 
>> backend back_1
>>option httpchk GET /index.php HTTP/1.1\r\nHost:\ test.pl
>>mode http
>>retries 10
>>balance roundrobin
>> 
>>  server chk1 127.0.0.1:81 weight 1 check
>>  server chk2 127.0.0.1:82 weight 1 check
>>  server chk3 127.0.0.1:83 weight 1 check backup
>> 
>> >8--
>> 
>> http_client -> haproxy -> (backend1|backend2|backend3)
>> 
>> let's go inside request:
>> 
>> A. haproxy recived request from http_client
>> B. haproxy sent request from http_client to backend1
>> C. backend1 said 500 internal server error
>> 
>> I want: :-)
>> D. haproxy sent request from_http to backend2 (or backup backend or
>> another one, or one more time to backend1)
>> 
>> I have: :-(
>> D. haproxy sent 500 internal server error to http_client from backend1
>> E. haproxy will mark backend1 as down if got 2 > errror 500 from backend1
>> 
>> 
>> It is possible to do that?
>> 
> Hello,
> 
> Yes it is possible but it could be dengerous for some kinde of
> application fe. billing system ;)
> Here is an example how to do it. I know that it is the hack but it works
> good ;P
> 
> frontend fr1
> default_backend back_1
> rspirep ^HTTP/...\ [23]0..* \0\nSet-Cookie:\
> cookiexxx=0;path=/;domain=.yourdomain.com
> rspirep ^(HTTP/...)\ 5[0-9][0-9].* \1\ 202\ Again\
> Please\nSet-Cookie:\
> cookiexxx=1;path=/;domain=.yourdomain.com\nRefresh:\ 6\nContent-Length:\
> Lenght_xxx\nContent-Type:\ text/html\n\n src="http://www.yourdomain.com/redispatch.pl";>
> 
> backend back_1
>  cookie  cookiexxx
>  server chk1 127.0.0.1:81 weight 1 check
>  server chk2 127.0.0.1:82 weight 1 check
>  server chk3 127.0.0.1:83 weight 1 check cookie 1 backup
> 
> Remember to set Lenght_xxx properly.
> 
> Best Regards
> Maciej Bogucki
> 
> 

-- 
Jonah Horowitz · Monitoring Manager · jhorow...@looksmart.net
W: 415-348-7694 · F: 415-348-7033 · M: 415-513-7202
LookSmart - Premium and Performance Advertising Solutions
625 Second Street, San Francisco, CA 94107





Re: HAProxy - Inline Monitoring?

2009-05-26 Thread Jonah Horowitz
Willy,

I can see why, with some web farms, you wouldn't want to take servers out of
rotation after just one, or a few 5xx errors, particularly since often they
are caused by bad user input.  In our case, any 5xx errors are almost always
an indication that the server in question is in a bad state.  Particularly
problematic is that a server serving 5xx errors tends to do so much faster
than one responding to legitimate requests.  This means that a bad server
can serve several thousand requests before the next health check kicks it
out of service.

Implementing inline monitoring dropped our 5xx error rate by two orders of
magnitude, so it is pretty important for us.  If we move forward, we'll
likely submit a patch if the functionality doesn't exist as things stand
now.  Perhaps it would be better if it was a counter that took a server out
after a set number of consecutive failed requests.

Jonah

On 5/24/09 9:56 PM, "Willy Tarreau"  wrote:

> Hi,
> 
> On Fri, May 22, 2009 at 11:37:14AM -0700, Jonah Horowitz wrote:
>> I¹m currently testing HAProxy for deployment.  Right now we use NetScaler
>> load balancers, and the provide a feature called ³inline monitoring².  With
>> inline monitoring the Netscaler will take a server out of rotation if it
>> responds with a 5xx error to a client response.  It does this separate from
>> standard health checks.  Is there a way to do this with HAProxy?
> 
> No, and I don't want to do the same as it seems a little bit risky to me.
> However what is planned is to switch to fast health-checks when a number
> of 5xx errors is encountered. That way, it would significantly reduce the
> time to detect a server failure without the risk of taking a server out of
> the farm on random errors.
> 
> Regards,
> Willy
> 

-- 
Jonah Horowitz · Monitoring Manager · jhorow...@looksmart.net
W: 415-348-7694 · F: 415-348-7033 · M: 415-513-7202
LookSmart - Premium and Performance Advertising Solutions
625 Second Street, San Francisco, CA 94107



smime.p7s
Description: S/MIME cryptographic signature


HAProxy - Inline Monitoring?

2009-05-22 Thread Jonah Horowitz
I¹m currently testing HAProxy for deployment.  Right now we use NetScaler
load balancers, and the provide a feature called ³inline monitoring².  With
inline monitoring the Netscaler will take a server out of rotation if it
responds with a 5xx error to a client response.  It does this separate from
standard health checks.  Is there a way to do this with HAProxy?

-- 
Jonah Horowitz · Monitoring Manager · jhorow...@looksmart.net
W: 415-348-7694 · F: 415-348-7033 · M: 415-513-7202
LookSmart - Premium and Performance Advertising Solutions
625 Second Street, San Francisco, CA 94107




smime.p7s
Description: S/MIME cryptographic signature