Re: Backend servers flagged as DOWN a lot & timeout check/connect

Krzysztof Olędzki Wed, 06 Jan 2010 11:09:29 -0800

On 2010-01-06 18:45, Paul Hirose wrote:

Busy little haproxy beaver today :)

Hehe, Hello Paul ;)

The docs under "retries" says if a connection attempt fails, it waits
one second, and then tries again.

This 1s timeout is only used in case of immediately error (like TCPRST), not in case of timeouts.

I was wondering how (if at all)
that works in conjunction with "timeout connect", which is how long
haproxy waits to try to connect to a backend server.  Is the one
second delay between retries *after* the "timeout connect" number of
seconds (after all, until "timeout connect" number of seconds has
passed, the connection attempt hasn't failed)?


- above two timeouts are independent,
- there is no 1s turnaround after a timeout.

I stumbled across "timeout check" today.  I've noticed my backend
servers tend to get flagged as DOWN a lot, especially when I first
start or reload haproxy.  Then usually, a few inter (or downinter)
seconds later, it gets flagged as up.  The backend server is
definitely not down during that time.


This does not look too good. :(

I suppose it's really not
haproxy itself, but either my own health-check script and/or xinetd
(which launches my health-check script) that might be causing a
problem.


Your xinetd scripts may be OK, as you are getting SOCKERR.

i don't know why it's doing this.  I do notice that whenever I do have
a backend server flagged as down, and I do a ps to look around, there
are a few instances of my health-check script running (or stalled or
whatever.)  After haproxy connects, it waits "timeout check" or
"inter" time for a response before giving up and calling that a

failure?


It waits "timeout check":
--- cut here ---

timeout check <timeout>
  Set additional check timeout, but only after a connection has been already
  established.

  May be used in sections:    defaults | frontend | listen | backend
                                 yes   |    no    |   yes  |   yes
  Arguments:
    <timeout> is the timeout value specified in milliseconds by default, but
              can be in any other unit if the number is suffixed by the unit,
              as explained at the top of this document.

  If set, haproxy uses min("timeout connect", "inter") as a connect timeout
  for check and "timeout check" as an additional read timeout. The "min" is
  used so that people running with *very* long "timeout connect" (eg. those
  who needed this due to the queue or tarpit) do not slow down their checks.
  Of course it is better to use "check queue" and "check tarpit" instead of
  long "timeout connect".

  If "timeout check" is not set haproxy uses "inter" for complete check
  timeout (connect + read) exactly like all <1.3.15 version.

  In most cases check request is much simpler and faster to handle than normal
  requests and people may want to kick out laggy servers so this timeout should
  be smaller than "timeout server".

  This parameter is specific to backends, but can be specified once for all in
  "defaults" sections. This is in fact one of the easiest solutions not to
  forget about it.

  See also: "timeout connect", "timeout queue", "timeout server",
            "timeout tarpit".

--- cut here ---

However, SOCKERR should not have anything to do with timeouts, so I'mafraid you may be be going into a wrong assumption here.

But since it's launched from xinetd, even though haproxy
might close the connection after "timeout check" (or "inter") amount
of time, I think the health check script process continues to stick
around until it's done.

I was thinking I might try setting "fastinter 1s" and "timeout check
900" (milliseconds, I think, by default), and "fall 4".  So if, for
some reason, a check fails (my script, xinetd, backend server, etc
"stalls"), then it'll only wait 900ms.  Then it'll try again 1s later.
 I figure w/in (900ms + 1s) later, it might be ok and respond back
properly (ignoring why it may have failed the first time.)  Not the
cleanest way, but if anyone has suggestions, I'd welcome them.

I tried using 1.4dev5 rather than the stable 1.3.22.  I noticed 1.4d5
shows  more diagnostics. in my /var/log/messages.

Indeed. Despite the http keep-alive support, the 1.4 release is going tobring a lot of improvements in health checks and stats. We are regullarytrying to make it easier to track and solve such problems, making themessages more verbose and by providing additional information.

This is what I see when I do the -sf option.


<CUT>

Jan  6 09:37:41 lbtest1 haproxy[21905]: Server LDAPFarm/LDAP1 is DOWN,
reason: Socket error, check duration: 46ms. 1 active and 0 backup
servers online. 0 sessions requeued, 0 total in queue.

Jan  6 09:37:41 lbtest1 haproxy[21905]: Server LDAPFarm/LDAP2 is DOWN,
reason: Socket error, check duration: 41ms. 0 active and 0 backup
servers online. 0 sessions requeued, 0 total in queue.

OK. Obviously you are getting a HCHK_STATUS_SOCKERR condition here,which is not very unambiguous yet. There are three calls toset_server_check_status() with "NULL" as an additional info:

$ egrep -R "set_s.*HCHK_STATUS_SOCKERR.*NULL" src
src/checks.c:                   set_server_check_status(s, HCHK_STATUS_SOCKERR, 
NULL);
src/checks.c:                                           
set_server_check_status(s, HCHK_STATUS_SOCKERR, NULL);
src/checks.c:                                           
set_server_check_status(s, HCHK_STATUS_SOCKERR, NULL);


Could you please try to change the second "NULL" to "strerror(errno)"?

I noticed you are using "addr" to use localhost as the source address ofyour healt-checks:

        server LDAP1 BBBB:389 check addr localhost port 9101 inter 5s fastinter 
1s downinter 5s fall 2 rise 2
        server LDAP2 CCCC:389 check addr localhost port 9102 inter 5s fastinter 
1s downinter 5s fall 2 rise 2


I think that could be the source of your problems.

I'll try to reproduce similar condition in my environment, but before Icould be able to do this - would you please try to drop "addr localhost"for now and check if it makes any difference?


Best regards,

                                Krzysztof Olędzki

Re: Backend servers flagged as DOWN a lot & timeout check/connect

Reply via email to