Re: HAProxy 1.5 incorrectly marks servers as DOWN

Pavlos Parissis Tue, 09 Sep 2014 15:09:06 -0700

On 08/09/2014 10:30 πμ, Juho Mäkinen wrote:
> 
> On Thu, Sep 4, 2014 at 11:35 PM, Pavlos Parissis
> <pavlos.paris...@gmail.com <mailto:pavlos.paris...@gmail.com>> wrote:
> 
>     On 04/09/2014 08:55 πμ, Juho Mäkinen wrote:
>     > I'm upgrading my old 1.4.18 haproxies to 1.5.4 and I have a mysterious
>     > problem where haproxy marks some backend servers as being DOWN with a
>     > message "L4TOUT in 2000ms". 
>     Are you sure that you haven't reached any sort of limits on your backend
>     servers? Number of open files and etc...
> 
> 
> Quite sure because I can always use curl from the haproxy machine to the
> backend machine and I get the response to the check command always
> without any delays. 
> 
>     Are you sure that backend servers return a response with HTTP status 200
>     on healtchecks?
> 
> 
> Yes. I also ran strace on a single haproxy process when the haproxy
> marked multiple backends as being down. Here's an example output:
> 
> 08:06:07.302582 connect(30, {sa_family=AF_INET, sin_port=htons(3500),
> sin_addr=inet_addr("172.16.6.102")}, 16) = -1 EINPROGRESS (Operation now
> in progress)
> 08:06:07.303024 recvfrom(30, 0x1305494, 16384, 0, 0, 0) = -1 EAGAIN
> (Resource temporarily unavailable)
> 08:06:07.303097 getsockopt(30, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
> 08:06:07.303167 sendto(30, "GET /check HTTP/1.0\r\n\r\n", 23,
> MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 23
> 08:06:07.304522 recvfrom(30, "HTTP/1.1 200 OK\r\nX-Powered-By:
> Express\r\nAccess-Control-Allow-Origin:
> *\r\nAccess-Control-Allow-Methods: GET, HEAD, POST, PUT, DELE"...,
> 16384, 0, NULL, NULL) = 503
> 08:06:07.304603 setsockopt(30, SOL_SOCKET, SO_LINGER, {onoff=1,
> linger=0}, 8) = 0
> 08:06:07.304666 close(30)               = 0
> 
> So the server clearly sends an HTTP 200 OK response, in just 1.9 ms. I
> analysed around 20 different checks via the strace to the same backend
> (which is marked down by haproxy) and none of them was over one second.
>


Are sure that for the above respone was for a healtcheck that marked the
server down? It is quite difficult to find this, I have been in your
position and took me some time to find the actual problem.


> Here's an example from haproxy logging what happens when the problem starts:
> 
> Sep  8 07:22:25 localhost haproxy[24282]: [08/Sep/2014:07:22:24.615]
> https comet-getcampaigns/comet-172.16.2.97:3500 423/0/1/3/427 200 502 -
> - ---- 1577/1577/3/1/0 0/0 "GET /mobile HTTP/1.1"
> Sep  8 07:22:25 localhost haproxy[24284]: [08/Sep/2014:07:22:24.280]
> https~ comet-getcampaigns/comet-172.16.2.97:3500 771/0/2/346/1121 200
> 40370 - - ---- 2769/2769/6/0/0 0/0 "GET /mobile HTTP/1.1"Sep  8 07:22:25
> localhost haproxy[24284]: [08/Sep/2014:07:22:25.090] https~
> comet-getcampaigns/comet-172.16.2.97:3500 379/0/2/-1/804 502 204 - -
> SH-- 2733/2733/7/0/0 0/0 "GET /mobile HTTP/1.1"
> Sep  8 07:22:25 localhost haproxy[24280]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Socket error,
> info: "Connection reset by peer", check duration: 231ms, status: 2/3 UP.
> Sep  8 07:22:25 localhost haproxy[24281]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Socket error,
> info: "Connection reset by peer", check duration: 217ms, status: 2/3 UP.
> Sep  8 07:22:25 localhost haproxy[24282]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Socket error,
> info: "Connection reset by peer", check duration: 137ms, status: 2/3 UP.
> Sep  8 07:22:25 localhost haproxy[24284]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Socket error,
> info: "Connection reset by peer", check duration: 393ms, status: 2/3
> UP.Sep  8 07:22:25 localhost haproxy[24284]: [08/Sep/2014:07:22:25.661]
> https comet-getcampaigns/comet-172.16.2.97:3500 305/0/1/-1/314 -1 0 - -
> SD-- 2718/2718/5/0/0 0/0 "GET /mobile HTTP/1.1"

The above means that the processes received a TCP RST packet on the
opening socket towards the backend. Have you run tcpdump on haproxy to
see if you backends send TCP RST.
Do you have any kind of firewall(network or host based) between haproxy
and backend?

> Sep  8 07:22:27 localhost haproxy[24278]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4
> connection problem, info: "Connection refused", check duration: 0ms,
> status: 2/3 UP.
> Sep  8 07:22:27 localhost haproxy[24279]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4
> connection problem, info: "Connection refused", check duration: 0ms,
> status: 2/3 UP.
> Sep  8 07:22:28 localhost haproxy[24280]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4
> connection problem, info: "Connection refused", check duration: 2ms,
> status: 1/3 UP.
> Sep  8 07:22:28 localhost haproxy[24284]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4
> connection problem, info: "Connection refused", check duration: 1ms,
> status: 1/3 UP.
> Sep  8 07:22:28 localhost haproxy[24282]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4
> connection problem, info: "Connection refused", check duration: 1ms,
> status: 1/3 UP.
> Sep  8 07:22:28 localhost haproxy[24283]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4
> connection problem, info: "Connection refused", check duration: 2ms,
> status: 1/3 UP.
> Sep  8 07:22:29 localhost haproxy[24278]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4
> connection problem, info: "Connection refused", check duration: 1ms,
> status: 1/3 UP.
> Sep  8 07:22:29 localhost haproxy[24279]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4
> connection problem, info: "Connection refused", check duration: 0ms,
> status: 1/3 UP.
> Sep  8 07:22:29 localhost haproxy[24281]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4
> connection problem, info: "Connection refused", check duration: 0ms,
> status: 0/2 DOWN.
> Sep  8 07:22:29 localhost haproxy[24281]: Server
> comet-getcampaigns/comet-172.16.2.97:3500 is DOWN. 15 active and 0
> backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
> Sep  8 07:22:29 localhost haproxy[24280]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4
> connection problem, info: "Connection refused", check duration: 0ms,
> status: 0/2 DOWN.
> Sep  8 07:22:29 localhost haproxy[24280]: Server
> comet-getcampaigns/comet-172.16.2.97:3500 is DOWN. 15 active and 0
> backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
> Sep  8 07:22:29 localhost haproxy[24284]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4
> connection problem, info: "Connection refused", check duration: 0ms,
> status: 0/2 DOWN.
> Sep  8 07:22:29 localhost haproxy[24284]: Server
> comet-getcampaigns/comet-172.16.2.97:3500 is DOWN. 14 active and 0
> backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
> Sep  8 07:22:30 localhost haproxy[24282]: Health check for server
> comet-getcampaigns/comet-172.16.2.97:3500 failed, reason: Layer4
> connection problem, info: "Connection refused", check duration: 0ms,
> status: 0/2 DOWN.
> Sep  8 07:22:30 localhost haproxy[24282]: Server
> comet-getcampaigns/comet-172.16.2.97:3500 is DOWN. 15 active and 0
> backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
> 
> So first there's a Socket error, info: "Connection reset by peer",
> followed with Layer4 connection problem, info: "Connection refused"
> until the server is marked DOWN.
> 
> 
>     You have only 1 stats server while you have 7 processes. You need to
>     enable for each process a stats socket, here is an example from a 24
>     processes
> 
> 
> All individual haproxy processes marks the same backend servers being
> down once the problem starts. Another machine has also similar problem,
> but the backend severs which are marked down are not the same. Note that
> I have three backends which all share the same servers (comet,
> comet-priority and comet-getcampaigns). The different backends might
> have different servers marked as DOWN. Example
> screenshot: 
> http://koti.kapsi.fi/garo/nobackup/haproxy-different-servers-down.png
> (the brown servers are marked down manually via the admin socket)
> 
> As a side question: The documentation was a bit unclear. If I have
> nbproc > 1 and I use the admin socket to turn servers administrative
> status down or up, do I need to do it to separated admin sockets per
> haproxy process, or can I just use one admin socket?
> 

You need a different socket. Each process can only be managed by a
dedicated stats socket. There isn't any kind of aggregation where you
issue a command to 1 stats socket and this command is pushed to all
processes. Next release will address this kind of issues.

signature.asc
Description: OpenPGP digital signature

Re: HAProxy 1.5 incorrectly marks servers as DOWN

Reply via email to