Re: maxconn vs. option httpchk

2011-03-24 Thread Willy Tarreau
Hi Bryan,

On Wed, Mar 23, 2011 at 09:27:01PM +, Cassidy, Bryan wrote:
> Hi all,
> 
> I've noticed an odd (lack of) interaction between "maxconn" and "option 
> httpchk"... 
> 
> If a server's maxconn limit has been reached, it appears that HTTP health 
> checks are still dispatched. If I've configured the maxconn limit to match 
> the number of requests the backend server can concurrently dispatch, and all 
> these connections are busy with slow requests, HAProxy will assume the server 
> is down; once the server completes a request, HAProxy waits until "rise" 
> health checks have succeeded (as expected if the server was really down, but 
> it was only busy). This makes overly busy times even worse.

Yes, that's a known situation. Minconn should always leave some room
for health checks. When you have two haproxies, you might have to leave
at least 2 connections for the health checks. In practice, 1 should be
OK because they're supposed to be fast and it generally is not an issue
if one waits a little bit to get a connection slot.

This issue is sometimes encountered on mongrel servers where only one
connection at a time is possible. The usual workaround for this case
is to set a check timeout larger than what you consider a long request
should be.

Even if that can sound frustrating at first, you have to realize that
if the server is failing to respond to health checks, there is no way
to know whether it's too much busy or if it's dead. So there's nothing
wrong with the current approach. If you pointed your browser to the
server, you'd observe the same behaviour. If you think that you'd tell
the difference because you'd wait longer, then it means you should
adjust your check timeout.

(...)
> I know I can work around this by setting maxconn to one less than a server's 
> maximum capacity (perhaps this would be a good idea for other reasons).

Yes that's the way to do it, and it will permit you to connect to the
server without passing through haproxy.

> I suspect I could work around this by using TCP status checks instead of HTTP 
> status checks, though I haven't tried this as I like the flexibility HTTP 
> health checks offer (like "disable-on-404").

You're right, but relying on TCP only will also not tell you when your
servers are really dead if they're just frozen.

> Is this behavior a bug or a feature? Intuitively I would have expected the 
> HTTP health checks to respect maxconn limits, but perhaps there was a 
> conscious decision to not do so (for instance, maybe it was considered 
> unacceptable for a server's health to be unknown when it is fully loaded).

We have a task on the TODO list to make health checks pass through the queue
and respect the maxconn too. This is especially important for mongrel. But
still, doing so does not cover the situation where you have multiple LBs or
when you need to check the server by yourself.

Regards,
Willy




maxconn vs. option httpchk

2011-03-23 Thread Cassidy, Bryan
Hi all,

I've noticed an odd (lack of) interaction between "maxconn" and "option 
httpchk"... 

If a server's maxconn limit has been reached, it appears that HTTP health 
checks are still dispatched. If I've configured the maxconn limit to match the 
number of requests the backend server can concurrently dispatch, and all these 
connections are busy with slow requests, HAProxy will assume the server is 
down; once the server completes a request, HAProxy waits until "rise" health 
checks have succeeded (as expected if the server was really down, but it was 
only busy). This makes overly busy times even worse.

I'm not sure if this explanation is clear; perhaps a concrete configuration 
might help.

listen load_balancer
bind :80
mode http

balance leastconn
option httpchk HEAD /healthchk
http-check disable-on-404

default-server port 8080 inter 2s rise 2 fall 1 maxconn 3
server srv1 srv1.example.com:8080 check
server srv2 srv2.example.com:8080 check

With the above toy example, if each of srv1 and srv2 can only respond to 3 
requests concurrently, and 6 slow requests come in (each taking more than 2 
seconds), both backend servers will be considered down until up to 4 seconds in 
the worst case (inter 2s * rise 2) after one of the requests finishes.

I know I can work around this by setting maxconn to one less than a server's 
maximum capacity (perhaps this would be a good idea for other reasons). I 
suspect I could work around this by using TCP status checks instead of HTTP 
status checks, though I haven't tried this as I like the flexibility HTTP 
health checks offer (like "disable-on-404").

Is this behavior a bug or a feature? Intuitively I would have expected the HTTP 
health checks to respect maxconn limits, but perhaps there was a conscious 
decision to not do so (for instance, maybe it was considered unacceptable for a 
server's health to be unknown when it is fully loaded).

Thanks,
Bryan