I'm upgrading my old 1.4.18 haproxies to 1.5.4 and I have a mysterious
problem where haproxy marks some backend servers as being DOWN with a
message "L4TOUT in 2000ms". Some times the message also has a star: "*
L4TOUT in 2000ms" (I didn't find what the star means from the docs). Also
the reported timeout varies between 2000ms and 2003ms.

This does not happen to every backend and it doesn't happen immediately.
After restart every backend is green and a few backends starts to get
marked DOWN after about 30 minutes or so. I'm also running two instances in
two different servers and they both suffer the same problem but the DOWN
servers aren't same. So server A might be marked DOWN on haproxy-1 and
server B marked down on haproxy-2 (or vice versa).

This seems to happen regardless how much traffic I run into the haproxies.
I can always ssh into the haproxies and run curl against the check url and
it always works, so this problem seems to be inside haproxy.

My haproxy config is a kind of long so I copied it here:
http://koti.kapsi.fi/garo/nobackup/haproxy.cfg (I've sanitised it a bit,
but only hostnames).

I've ran the logging with verbose debugging to check if that gives any
clues on the health check issue, but the logs did not reveal anything to my
eye. I can however gather a new log sample on the health checks, but the
haproxies are now receiving production traffic so the log amount would be
too much to gather at the current moment.

I've also gathered some tcpdump traffic to the hosts marked DOWN and
strangely it seems that the hosts is receiving queries. It could be that
one (or more) processes (I'm using nbprocs 7 on my 8 core aws c3.2xlarge
instance) haven't marked the host down. Trying to refresh the stats uri
doesn't seem to indicate this, but it's hard to be sure as the probability
of going thru all seven different processes fast enough is low.

All clues and debugging ideas are greatly appreciated.

Reply via email to