Old instances continue to accept connections after graceful reload

Maciej Katafiasz Fri, 05 Feb 2016 16:04:03 -0800

Hi,

we're hitting a big roadblock in getting HAProxy to load balance our
apps with transparent reloads. Namely, when the new instance starts up
with -sf option, the old instances don't seem to react to SIGUSR1
properly. They continue to hang around indefinitely, and snatch
incoming requests from the new instance, which can lead it to them
trying to route requests to backends that have been taken out of
commission since.


We tried inspecting it, and the first lead we got was "option
forceclose", but that didn't help. After playing with the debugging
info available over stats socket interface, we've come to the
following conclusions:

- It reports no connections that would lead it to not shutting down
- Consistent with the above, it does NOT need to have any working
backends to distribute load to. However, making HTTP requests to
HAProxy's port still seems to increase chances of it getting into the
bugged state
- Reported values for "tasks" are in the range of 9-15, and
"run_queue" is 1-2. I don't know the exact meaning of these values, so
I don't know how to judge their importance, but it doesn't seem that
properly exiting processes (see below) had values any lower.
- SIGTTOU doesn't appear to have an effect -- the process doesn't
unbind its socket and continues to receive connections. It *does*
print "global.srvcls", "global.clicls" and "global.close" to stdout
when run in debugging mode though, after which it happily prints
handling details for newly accepted connections.
- SIGUSR1 does not shut it down when it's in the bugged state. I can't
tell whether it *receives* USR1, though, as there's no debug output
for signal received, only for signal-caused shutdown.

Weirdly enough, it doesn't happen every time -- every once in a while
everything would be A-OK and the instances shutting down cleanly,
which served to confuse debugging greatly, as it appeared that things
we did to the config had an effect. Nothing seemed to be consistent
though, and eventually we'd be back to staring at a long process list
of haproxy instances stealing requests from each other.

We've created a test setup that reproduces the issue; it's very
similar to the setup we're trying to deploy (modulo backend "services"
being complete dummies that don't even serve anything on port 80).
It's based on docker, so it requires a new enough version of Docker to
run (we're running 1.9 here). To run it, unpack it, then run
./haproxy-bug-repro.sh. It's based on registering backends (== newly
spawned docker containers exposing port 80 and named appropriately)
via registrator into Consul, which facilitates service discovery.
Consul's catalogue is then templated into HAProxy's config, and
haproxy process reloaded. To see the issue, do:

$ docker exec -ti haproxy sh
# ps ax

(note: if you want to start over, you will have to remove docker's
containers, because they have to be named uniquely. Running

$ docker rm -f backend-{1,2,3,4,5,6,7,8,9,10} haproxy

should help)

Link to the tarball:
https://purestorage.app.box.com/s/nnzqueais46plzd9xfisnmkeab7j9s0y

I will be sending it as an attachment in a separate mail as a followup
to this one, in case the mailing list software scrubs attachments
and/or considers them spam.

Thanks,
Maciej

Old instances continue to accept connections after graceful reload

Reply via email to