Re: Backend servers flagged as DOWN a lot & timeout check/connect

Paul Hirose Wed, 06 Jan 2010 12:31:49 -0800

Thank you for your help.:)

2010/1/6 Krzysztof Olędzki <o...@ans.pl>:
> On 2010-01-06 18:45, Paul Hirose wrote:
>> The docs under "retries" says if a connection attempt fails, it waits
>> one second, and then tries again.
>
> This 1s timeout is only used in case of immediately error (like TCP RST),
> not in case of timeouts.
>
>> I was wondering how (if at all)
>> that works in conjunction with "timeout connect", which is how long
>> haproxy waits to try to connect to a backend server.  Is the one
>> second delay between retries *after* the "timeout connect" number of
>> seconds (after all, until "timeout connect" number of seconds has
>> passed, the connection attempt hasn't failed)?
>
> - above two timeouts are independent,
> - there is no 1s turnaround after a timeout.



So to summarize the timeout issues connecting to the backend server, a
client request comes into haproxy, which is then sent to one of the
backend servers.  If the connection fails immediately, then haproxy
waits 1s and then tries again to the same backend server.  It repeats
this up to "retries" number of times, or up to "retries - 1" amount of
times in "option redispatch" is set (the last retry being sent to some
other backend server.)

For a non-immediate error (as in just trying to connect and hanging
there) but still not actually making a connection, haproxy will wait
up to "timeout connect" amount of time.   If after that much time, a
connection still isn't established, haproxy will immediately try to
connect again, rather than waiting 1s and then try connecting again to
the same backend server?

>> This is what I see when I do the -sf option.
>
> <CUT>
>
>> Jan  6 09:37:41 lbtest1 haproxy[21905]: Server LDAPFarm/LDAP1 is DOWN,
>> reason: Socket error, check duration: 46ms. 1 active and 0 backup
>> servers online. 0 sessions requeued, 0 total in queue.
>
>> Jan  6 09:37:41 lbtest1 haproxy[21905]: Server LDAPFarm/LDAP2 is DOWN,
>> reason: Socket error, check duration: 41ms. 0 active and 0 backup
>> servers online. 0 sessions requeued, 0 total in queue.
>
> OK. Obviously you are getting a HCHK_STATUS_SOCKERR condition here, which is
> not very unambiguous yet. There are three calls to set_server_check_status()
> with "NULL" as an additional info:
>
>> $ egrep -R "set_s.*HCHK_STATUS_SOCKERR.*NULL" src
>> src/checks.c:                   set_server_check_status(s,
>> HCHK_STATUS_SOCKERR, NULL);
>> src/checks.c:
>> set_server_check_status(s, HCHK_STATUS_SOCKERR, NULL);
>> src/checks.c:
>> set_server_check_status(s, HCHK_STATUS_SOCKERR, NULL);
>
> Could you please try to change the second "NULL" to "strerror(errno)"?

I've made that patch.  I am using 1.4-dev5.tar.gz, but not the snaphot
from 1.4-ss-20100106.tar.gz.  With your three strerror(errno) patches
in, I am now seeing a bit more info in my /var/log/messages:

Jan  6 12:07:30 lbtest1 haproxy[7319]: Server LDAPFarm/LDAP1 is DOWN,
reason: Socket error, info: "Resource temporarily unavailable", check
duration: 51ms. 1 active and 0 backup servers online. 0 sessions
requeued, 0 total in queue.

Jan  6 12:07:30 lbtest1 haproxy[7319]: Server LDAPFarm/LDAP2 is DOWN,
reason: Socket error, info: "Resource temporarily unavailable", check
duration: 41ms. 0 active and 0 backup servers online. 0 sessions
requeued, 0 total in queue.

I also sometimes get a slightly different error message on occasion:
Jan  6 12:21:49 lbtest1 haproxy[15709]: Server LDAPSFarm/LDAPS1 is
DOWN, reason: Socket error, info: "Operation now in progress", check
duration: 277ms. 1 active and 0 backup servers online. 0 sessions
requeued, 0 total in queue.

I don't notice a pattern to which backend server health-check gets
either Operation now in progress or Resource temporarily unavailable
error.  It seems random.

> I noticed you are using "addr" to use localhost as the source address of
> your healt-checks:
>
>>        server LDAP1 BBBB:389 check addr localhost port 9101 inter 5s
>> fastinter 1s downinter 5s fall 2 rise 2
>>        server LDAP2 CCCC:389 check addr localhost port 9102 inter 5s
>> fastinter 1s downinter 5s fall 2 rise 2
>
> I think that could be the source of your problems.
>
> I'll try to reproduce similar condition in my environment, but before I
> could be able to do this - would you please try to drop "addr localhost" for
> now and check if it makes any difference?

I need to do the "check addr localhost port 9101" for example.  My
health-check scripts actually run on the same computer as haproxy runs
(and not on the backend server.)  I don't have access to the actual
backend server(s) and thus cannot put a health-check script on them.
I changed localhost to 127.0.0.1 just on the offchance there might be
something there.

My xinetd.conf has "instances 50", "per_source 10", so I figure xinetd
should be able to run multiple copies of my health-check scripts at
one time, if it came to that.  I do have "spread-checks 20" in my
haproxy.cfg file, just to try and spread it around.  But I figure a
reload/start of haproxy won't spread the checks around.

Thank you,
PH

Re: Backend servers flagged as DOWN a lot & timeout check/connect

Reply via email to