Re: Backend servers flagged as DOWN a lot & timeout check/connect

Krzysztof Olędzki Wed, 06 Jan 2010 12:59:55 -0800

On 2010-01-06 21:31, Paul Hirose wrote:

2010/1/6 Krzysztof Olędzki <o...@ans.pl>:

On 2010-01-06 18:45, Paul Hirose wrote:

The docs under "retries" says if a connection attempt fails, it waits
one second, and then tries again.

This 1s timeout is only used in case of immediately error (like TCP RST),
not in case of timeouts.

I was wondering how (if at all)
that works in conjunction with "timeout connect", which is how long
haproxy waits to try to connect to a backend server.  Is the one
second delay between retries *after* the "timeout connect" number of
seconds (after all, until "timeout connect" number of seconds has
passed, the connection attempt hasn't failed)?

- above two timeouts are independent,
- there is no 1s turnaround after a timeout.



So to summarize the timeout issues connecting to the backend server, a
client request comes into haproxy, which is then sent to one of the
backend servers.  If the connection fails immediately, then haproxy
waits 1s and then tries again to the same backend server.  It repeats
this up to "retries" number of times, or up to "retries - 1" amount of
times in "option redispatch" is set (the last retry being sent to some
other backend server.)

For a non-immediate error (as in just trying to connect and hanging
there) but still not actually making a connection, haproxy will wait
up to "timeout connect" amount of time.   If after that much time, a
connection still isn't established, haproxy will immediately try to
connect again, rather than waiting 1s and then try connecting again to
the same backend server?

Not yet. Such enhancement has been recently suggested even with a patch,but wasn't implemented yet, as I would like to skip 1s turnaround onlyif there is a high chance to select a differet server. However, it isnearly on top of my "short things" TODO list.

This is what I see when I do the -sf option.

<CUT>

Jan  6 09:37:41 lbtest1 haproxy[21905]: Server LDAPFarm/LDAP1 is DOWN,
reason: Socket error, check duration: 46ms. 1 active and 0 backup
servers online. 0 sessions requeued, 0 total in queue.
Jan  6 09:37:41 lbtest1 haproxy[21905]: Server LDAPFarm/LDAP2 is DOWN,
reason: Socket error, check duration: 41ms. 0 active and 0 backup
servers online. 0 sessions requeued, 0 total in queue.

OK. Obviously you are getting a HCHK_STATUS_SOCKERR condition here, which is
not very unambiguous yet. There are three calls to set_server_check_status()
with "NULL" as an additional info:

$ egrep -R "set_s.*HCHK_STATUS_SOCKERR.*NULL" src
src/checks.c:                   set_server_check_status(s,
HCHK_STATUS_SOCKERR, NULL);
src/checks.c:
set_server_check_status(s, HCHK_STATUS_SOCKERR, NULL);
src/checks.c:
set_server_check_status(s, HCHK_STATUS_SOCKERR, NULL);

Could you please try to change the second "NULL" to "strerror(errno)"?


I've made that patch.  I am using 1.4-dev5.tar.gz, but not the snaphot
from 1.4-ss-20100106.tar.gz.  With your three strerror(errno) patches
in, I am now seeing a bit more info in my /var/log/messages:

Jan  6 12:07:30 lbtest1 haproxy[7319]: Server LDAPFarm/LDAP1 is DOWN,
reason: Socket error, info: "Resource temporarily unavailable", check
duration: 51ms. 1 active and 0 backup servers online. 0 sessions
requeued, 0 total in queue.

Jan  6 12:07:30 lbtest1 haproxy[7319]: Server LDAPFarm/LDAP2 is DOWN,
reason: Socket error, info: "Resource temporarily unavailable", check
duration: 41ms. 0 active and 0 backup servers online. 0 sessions
requeued, 0 total in queue.


Like I thought - EAGAIN. It doesn't tell us too much. :(

I also sometimes get a slightly different error message on occasion:
Jan  6 12:21:49 lbtest1 haproxy[15709]: Server LDAPSFarm/LDAPS1 is
DOWN, reason: Socket error, info: "Operation now in progress", check
duration: 277ms. 1 active and 0 backup servers online. 0 sessions
requeued, 0 total in queue.


EINPROGRESS, the same. :(

I don't notice a pattern to which backend server health-check gets
either Operation now in progress or Resource temporarily unavailable
error.  It seems random.

For now, the only suggest I have for you is to try running haproxy understrace and check which syscalls fail shortly before a "Socket error"message is written. But I'm afaraid, we would end up needing to addsomething like:http://haproxy.1wt.eu/git?p=haproxy.git;a=commitdiff;h=6492db5453a3d398f096e9f7d6e84ea3984a1f04

in more places.

I noticed you are using "addr" to use localhost as the source address of
your healt-checks:

       server LDAP1 BBBB:389 check addr localhost port 9101 inter 5s
fastinter 1s downinter 5s fall 2 rise 2
       server LDAP2 CCCC:389 check addr localhost port 9102 inter 5s
fastinter 1s downinter 5s fall 2 rise 2

I think that could be the source of your problems.

I'll try to reproduce similar condition in my environment, but before I
could be able to do this - would you please try to drop "addr localhost" for
now and check if it makes any difference?


I need to do the "check addr localhost port 9101" for example.  My
health-check scripts actually run on the same computer as haproxy runs
(and not on the backend server.)  I don't have access to the actual
backend server(s) and thus cannot put a health-check script on them.
I changed localhost to 127.0.0.1 just on the offchance there might be
something there.

My xinetd.conf has "instances 50", "per_source 10", so I figure xinetd
should be able to run multiple copies of my health-check scripts at
one time, if it came to that.  I do have "spread-checks 20" in my
haproxy.cfg file, just to try and spread it around.  But I figure a
reload/start of haproxy won't spread the checks around.

Right. I was responding to your mail shortly before I leave my work andI was too hurry and mistook "addr" with "source". Please ignore thisidea for now.


Best regards,

                                Krzysztof Olędzki

Re: Backend servers flagged as DOWN a lot & timeout check/connect

Reply via email to