Hi Alexey,

On Thu, Jun 09, 2011 at 01:32:06PM +0400, Alexey Vlasov wrote:
> Hi!
> 
> Here, actually, I've found the description of the same problem. At
> Apache falling/restart, haproxy returns to users 502 error.
> > http://www.formilux.org/archives/haproxy/0812/1575.html
> 
> Here I give the example of how it looks:
> 
> # while true; do echo -n `date "+%T.%N "`" "; curl -s
> http://test-nl11-apache-aux2.com/uptime.php; echo; done
> 12:50:21.294819803  OK
> 12:50:21.481879293  OK
> 12:50:21.666777343  OK
> ...
> I stop Apache:
> # /opt/apache_aux2_pool1/current/sbin/apachectl -k stop
> I receive an error:
> ...
> 12:50:21.854037923  OK
> 12:50:22.039332296  OK
> 12:50:22.244071674  <html><body><h1>502 Bad Gateway</h1>
> The server returned an invalid or incomplete response.
> </body></html>
> 
> 12:50:22.463404198  OK
> 12:50:22.653188547  OK
> ...
> ...
> 
> Haproxy log in attach.
> 
> My haproxy.conf:
> ==========
> global
>     daemon
>     user        haproxy
>     group       haproxy
>     chroot      /var/empty
>     ulimit-n    32000
> 
> defaults
>     log         127.0.0.1 local1 notice
>     mode        http
>     maxconn     2000
>     balance     roundrobin
>     option      forwardfor except 111.222.111.222/32
>     option      redispatch
>     retries     10
>     stats       enable
>     stats uri   /haproxy?stats
>     timeout connect    5000
>     timeout client     150000
>     timeout server     150000
> 
> listen  backend_pool1   111.222.111.222:9099
>     option      httplog
>     log         127.0.0.1 local2
>     cookie      SERVERID insert indirect
>     option      httpchk
>     capture     request header Host len 40
>     server      pool1 111.222.111.222:8099 weight 256 cookie backend1_pool1 
> check inter 500 fastinter 100 fall 1 rise 2 maxconn 500
>     server      pool2 111.222.111.222:8100 weight   1 cookie backend1_pool2 
> check inter 800 fastinter 100 fall 1 rise 2 maxconn 250
>     server      pool3 111.222.111.222:8101 backup
> ==========
> 
> My challenge is to make ha proxy not to return to the user an error 
> 502 at once, but to make it try to repeat the inquiry times 
> through N time intervals, and if it all failed only  then let haproxy 
> return to the user the 502 error. Can I somehow do it or is there
> any other suitable decision?

It's more complex than just black-or-white. First, there's a solution so
that you never have any error at all, but let me first explain what is
happening and why it's behaving that way.

When you restart Apache that way, you break existing connections at any
point during their processing. Some were waiting for a request to come,
some were processing the request, some were sending response headers, and
some were sending response data. The 502 that you're seeing indicates that
Apache had accepted the connection but did not finish sending headers, so
most likely it was processing the request. The processes killed before
accepting the connection will at most cause a connection retry to occur,
and if it's killed once Apache has started sending a response, then you
won't see the 502, the client will just get a truncated response.

There are two issues with retrying requests. The first one is related to
the implementation (here haproxy, but any component will have a limit
eventhough different). The issue is that haproxy has a request buffer
which has a limited size. A full request is buffered, parsed, processed
and forwarded to the server. From this point, the request is not in
haproxy's buffer anymore. In theory, by adding a few more pointers,
until the data in the buffer are not replaced, we could be able to find
the request there and retry it, and indeed we'll have to do that in the
future, but more on this below. The problem is that some requests will
definitely not fit in the buffer at all. Let's say you get a PUT request
with a file of 10 MB. The server breaks the connection after you have
forwarded 9 MB. You'd have to forward those 9 MB again, but it's clearly
impossible to keep that large buffers just for hypothetical retries. So
there will always be a class of requests that cannot be replayed because
of implementation limitations, whatever the limits you set. And the same
is true with the response : for all requests that were aborted after the
server started to respond, we can't tell the client "hey, please ignore
what I sent you till now, here's a new version instead".

The second issue comes from the HTTP specification. HTTP says that only
idempotent requests may be replayed, which means requests whose effect
on the server is exactly the same whether you do it once or any larger
number of times. A GET should be an idempotent request (in theory). If
you retrieve a static file, fail in the middle and do it again, the
server's state will not change. GET with a query string starts to be
subject to caution. And a POST definitely is not an idempotent request.
When you order a book on a site, you don't want a stupid load balancer
between the site and you to silently post your order a second time when
the first connection died in the middle. Same when you click "delete
this mail" in your preferred webmail, you don't want the LB to send that
twice.

So if a non-idempotent request fails, it will never be replayed at all,
regardless of buffer capacities, and that rule is mandated by the HTTP
specification and respecting it is very important.

This means that if you blindly restart an HTTP server which holds
connections, whatever the proxy, LB or whatever component in front
of it, you will always make it visible for some users.

So what can we do ? Haproxy includes several ways to seamlessly act on
your servers. There are two common methods people are using, depending
on how the work is split between their production teams :

  - those doing everything themselves disable the server on haproxy,
    wait for all activer requests to complete, and restart the server ;

  - those where the LB is not managed by the same people as the servers
    prefer to make the server tell haproxy that it is going down.

The first method involves the maintenance mode which can be triggered
either by the web stats page, or by the CLI ("disable server", "enable
server").

The second method is different, it consists in making the server change
the result to the health checks so that haproxy stops sending requests
there, then it can restart. There are two variants, hard (basically
return 500 instead of 200 to the health check), and soft (respond 404
first). The former kicks every user out of the server and is appropriate
for static servers or stateless servers in general. The second one only
accepts requests from users with a persistence cookie, but will not
assign new users to the server. It's for stateful servers (eg: the one
where you're ordering your book and which holds your session). In both
cases, either you wait for the logs to report that there's no activity
anymore, or you decide that a few minutes after the announce, you
automatically perform the restart.

There are people who program automatic config updates, haproxy upgrades,
or system reboots with this second method. It's very convenient and
flexible and you never break any connection that way.

Hoping this helps,
Willy


Reply via email to