Re: Problems with layer7 check timeout

2012-05-29 Thread Willy Tarreau
Hi Kevin,

On Tue, May 29, 2012 at 03:08:17PM -0400, Kevin M Lange wrote:
> I've been monitoring our service availability check (http head of a 
> resource that truly provides availability status of the application).  
> Under normal circumstances, the check takes 2-3 seconds.  We found 
> periods of time where the application would take 15+seconds and fail (I 
> did not capture HTTP code, but I'm pretty sure it was a 500 series from 
> what I've been looking through).  These failure periods match the times 
> where haproxy was indicating timeouts of 1002ms.  So, it looks like 
> haproxy is doing its job.  Is this then a bug in the logging of the 
> timeout value (reporting 1002ms vs 15000+ms)?

This is the strange part, as I didn't manage to get this indication on
my test platform. Would you accept to send me in private the network
capture for a series of checks that were mis-reported ? Depending on
how it's segmented and aborted, maybe I could get a clue about what is
happening.

> We haven't had any problems since 25 May, but we're keeping watch.

It reminds me the old days of early Opterons where clock was unsynced
between the cores and was jumping back and forth, causing early timeouts
and wrong timer reports. The issue comes back with the use of VMs
everywhere. This led me to implement the internal monotonic clock which
compensates for jumps, which cannot exceed 1s now. But even with a 1s
jump, this does not explain 15000 -> 1002ms, so right now I'm a bit stuck.

Regards,
Willy




Re: Problems with layer7 check timeout

2012-05-29 Thread Kevin M Lange
I've been monitoring our service availability check (http head of a 
resource that truly provides availability status of the application).  
Under normal circumstances, the check takes 2-3 seconds.  We found 
periods of time where the application would take 15+seconds and fail (I 
did not capture HTTP code, but I'm pretty sure it was a 500 series from 
what I've been looking through).  These failure periods match the times 
where haproxy was indicating timeouts of 1002ms.  So, it looks like 
haproxy is doing its job.  Is this then a bug in the logging of the 
timeout value (reporting 1002ms vs 15000+ms)?


We haven't had any problems since 25 May, but we're keeping watch.

- Kevin

On 5/25/12 11:18 AM, Lange, Kevin M. (GSFC-423.0)[RAYTHEON COMPANY] wrote:

Willy,
I'll try the patch, but not until next week because of the holiday 
weekend.  I don't want to make a significant change that I would have 
to support over the long weekend.
I'm capturing tcpdump between SLB and the three backends.  I'd like to 
have a capture during an "outage".  I expect to see something today, 
and I'll send to you.

- Kevin


On May 25, 2012, at 2:12 AM, Willy Tarreau wrote:


Hi again Kevin,

Well, I suspect that there might be a corner case with the bug I fixed
which might have caused what you observed.

The "timeout connect" is computed from the last expire date. Since
"timeout check" was added upon connection establishment but the task
was woken too late, then that after a first check failure reported
too late, you can have the next check timeout shortened.

It's still unclear to me how it is possible that the check timeout is
reported this small, considering that it's updated once the connect
succeeds. But performing computations in the past is never a good way
to have something reliable.

Could you please apply the attached fix for the bug I mentionned in
previous mail, to see if the issue is still present ? After all, I
would not be totally surprized if this bug has nasty side effects
like this.

Thanks,
Willy

<0001-BUG-MINOR-checks-expire-on-timeout.check-if-smaller-.patch>


Kevin Lange
kevin.m.la...@nasa.gov 
kla...@raytheon.com 
W: +1 (301) 851-8450
Raytheon  | NASA  | ECS Evolution Development Program
https://www.echo.com  | https://www.raytheon.com





Re: Problems with layer7 check timeout

2012-05-25 Thread Willy Tarreau
On Fri, May 25, 2012 at 10:18:20AM -0500, Lange, Kevin M. (GSFC-423.0)[RAYTHEON 
COMPANY] wrote:
> Willy,
> I'll try the patch, but not until next week because of the holiday weekend.
> I don't want to make a significant change that I would have to support over
> the long weekend.

This sounds reasonable to anyone who has a little bit of experience with
production systems :-)

> I'm capturing tcpdump between SLB and the three backends.  I'd like to have a
> capture during an "outage".  I expect to see something today, and I'll send
> to you.

Fine, that's very much appreciated!

Regards,
Willy




Re: Problems with layer7 check timeout

2012-05-25 Thread Lange, Kevin M. (GSFC-423.0)[RAYTHEON COMPANY]
Willy,
I'll try the patch, but not until next week because of the holiday weekend.  I 
don't want to make a significant change that I would have to support over the 
long weekend.
I'm capturing tcpdump between SLB and the three backends.  I'd like to have a 
capture during an "outage".  I expect to see something today, and I'll send to 
you.
- Kevin


On May 25, 2012, at 2:12 AM, Willy Tarreau wrote:

> Hi again Kevin,
> 
> Well, I suspect that there might be a corner case with the bug I fixed
> which might have caused what you observed.
> 
> The "timeout connect" is computed from the last expire date. Since
> "timeout check" was added upon connection establishment but the task
> was woken too late, then that after a first check failure reported
> too late, you can have the next check timeout shortened.
> 
> It's still unclear to me how it is possible that the check timeout is
> reported this small, considering that it's updated once the connect
> succeeds. But performing computations in the past is never a good way
> to have something reliable.
> 
> Could you please apply the attached fix for the bug I mentionned in
> previous mail, to see if the issue is still present ? After all, I
> would not be totally surprized if this bug has nasty side effects
> like this.
> 
> Thanks,
> Willy
> 
> <0001-BUG-MINOR-checks-expire-on-timeout.check-if-smaller-.patch>

Kevin Lange
kevin.m.la...@nasa.gov
kla...@raytheon.com
W: +1 (301) 851-8450
Raytheon  | NASA  | ECS Evolution Development Program
https://www.echo.com  | https://www.raytheon.com



smime.p7s
Description: S/MIME cryptographic signature


Re: Problems with layer7 check timeout

2012-05-25 Thread Lange, Kevin M. (GSFC-423.0)[RAYTHEON COMPANY]
Willy,
I'll try the patch, but not until next week because of the holiday weekend.  I 
don't want to make a significant change that I would have to support over the 
long weekend.
I'm capturing tcpdump between SLB and the three backends.  I'd like to have a 
capture during an "outage".  I expect to see something today, and I'll send to 
you.
- Kevin


On May 25, 2012, at 2:12 AM, Willy Tarreau wrote:

> Hi again Kevin,
> 
> Well, I suspect that there might be a corner case with the bug I fixed
> which might have caused what you observed.
> 
> The "timeout connect" is computed from the last expire date. Since
> "timeout check" was added upon connection establishment but the task
> was woken too late, then that after a first check failure reported
> too late, you can have the next check timeout shortened.
> 
> It's still unclear to me how it is possible that the check timeout is
> reported this small, considering that it's updated once the connect
> succeeds. But performing computations in the past is never a good way
> to have something reliable.
> 
> Could you please apply the attached fix for the bug I mentionned in
> previous mail, to see if the issue is still present ? After all, I
> would not be totally surprized if this bug has nasty side effects
> like this.
> 
> Thanks,
> Willy
> 
> <0001-BUG-MINOR-checks-expire-on-timeout.check-if-smaller-.patch>

Kevin Lange
kevin.m.la...@nasa.gov
kla...@raytheon.com
W: +1 (301) 851-8450
Raytheon  | NASA  | ECS Evolution Development Program
https://www.echo.com  | https://www.raytheon.com



smime.p7s
Description: S/MIME cryptographic signature


Re: Problems with layer7 check timeout

2012-05-25 Thread Lange, Kevin M. (GSFC-423.0)[RAYTHEON COMPANY]
Willy,
I'll try the patch, but not until next week because of the holiday weekend.  I 
don't want to make a significant change that I would have to support over the 
long weekend.
I'm capturing tcpdump between SLB and the three backends.  I'd like to have a 
capture during an "outage".  I expect to see something today, and I'll send to 
you.
- Kevin


On May 25, 2012, at 2:12 AM, Willy Tarreau wrote:

> Hi again Kevin,
> 
> Well, I suspect that there might be a corner case with the bug I fixed
> which might have caused what you observed.
> 
> The "timeout connect" is computed from the last expire date. Since
> "timeout check" was added upon connection establishment but the task
> was woken too late, then that after a first check failure reported
> too late, you can have the next check timeout shortened.
> 
> It's still unclear to me how it is possible that the check timeout is
> reported this small, considering that it's updated once the connect
> succeeds. But performing computations in the past is never a good way
> to have something reliable.
> 
> Could you please apply the attached fix for the bug I mentionned in
> previous mail, to see if the issue is still present ? After all, I
> would not be totally surprized if this bug has nasty side effects
> like this.
> 
> Thanks,
> Willy
> 
> <0001-BUG-MINOR-checks-expire-on-timeout.check-if-smaller-.patch>

Kevin Lange
kevin.m.la...@nasa.gov
kla...@raytheon.com
W: +1 (301) 851-8450
Raytheon  | NASA  | ECS Evolution Development Program
https://www.echo.com  | https://www.raytheon.com



smime.p7s
Description: S/MIME cryptographic signature


Re: Problems with layer7 check timeout

2012-05-24 Thread Willy Tarreau
Hi again Kevin,

Well, I suspect that there might be a corner case with the bug I fixed
which might have caused what you observed.

The "timeout connect" is computed from the last expire date. Since
"timeout check" was added upon connection establishment but the task
was woken too late, then that after a first check failure reported
too late, you can have the next check timeout shortened.

It's still unclear to me how it is possible that the check timeout is
reported this small, considering that it's updated once the connect
succeeds. But performing computations in the past is never a good way
to have something reliable.

Could you please apply the attached fix for the bug I mentionned in
previous mail, to see if the issue is still present ? After all, I
would not be totally surprized if this bug has nasty side effects
like this.

Thanks,
Willy
 
>From 78604116c3cbe23987bef94cb0d7aa15e6d4371b Mon Sep 17 00:00:00 2001
From: Willy Tarreau 
Date: Fri, 25 May 2012 07:41:38 +0200
Subject: [PATCH] BUG/MINOR: checks: expire on timeout.check if smaller than 
timeout.connect

It happens that haproxy doesn't displace the task in the wait queue when
validating a connection, so if the check timeout is set to a smaller value
than timeout.connect, it will not strike before timeout.connect.

The bug is present at least in 1.4.15..1.4.21, so the fix must be backported.
(cherry picked from commit 1e44a49c8973f08ee1e35d8737f4677db11cf7ab)
---
 src/checks.c |4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/src/checks.c b/src/checks.c
index 7255817..0aa65c0 100644
--- a/src/checks.c
+++ b/src/checks.c
@@ -781,8 +781,10 @@ static int event_srv_chk_w(int fd)
ret = send(fd, check_req, check_len, MSG_DONTWAIT | 
MSG_NOSIGNAL);
if (ret == check_len) {
/* we allow up to  if nonzero 
for a responce */
-   if (s->proxy->timeout.check)
+   if (s->proxy->timeout.check) {
t->expire = tick_add_ifset(now_ms, 
s->proxy->timeout.check);
+   task_queue(t);
+   }
EV_FD_SET(fd, DIR_RD);   /* prepare for reading 
reply */
goto out_nowake;
}
-- 
1.7.2.1.45.g54fbc



Re: Problems with layer7 check timeout

2012-05-24 Thread Willy Tarreau
Hi Kevin,

On Thu, May 24, 2012 at 09:01:43PM -0500, Lange, Kevin M. (GSFC-423.0)[RAYTHEON 
COMPANY] wrote:
> Monsieur Tarreau,
> 
> Actually, we are seeing frontend service availability flapping. This morning
> particularly.  Missing from my snippet is the logic for an unplanned outage
> landing page, that our customers were seeing this morning, so it haproxy
> truly is "timing out" and marking each backend as down until there are no
> backend servers available, throwing up the unplanned outage landing page.

I'm not surprized, if you observe that checks should last more than 1 second
to work correctly :-/

I have tested your configuration here. First I can say it's not bad reporting,
as the timers in the logs are the correct ones. Second, I noticed a bug, when
both "timeout check" and "timeout connect" are set, haproxy uses the largest
of the two for the check ! The reason is that it did not displace the task in
the queue upon connection establishment if the check timeout is smaller than
the connect timeout. I've fixed it now.

This aside, I could not reproduce the issue here. I agree with Baptiste, a
tcpdump would certainly help. Maybe we're facing corner case issues which
I did not test, such as servers sending partial responses or things like
this, I don't know.

Regards,
Willy




Re: Problems with layer7 check timeout

2012-05-24 Thread Baptiste
Hi Lange,

Would it be possible to take a trace (tcpdump) of the health check?
This may help as well.

Cheers


On Fri, May 25, 2012 at 4:01 AM, Lange, Kevin M. (GSFC-423.0)[RAYTHEON
COMPANY]  wrote:
> Monsieur Tarreau,
>
> Actually, we are seeing frontend service availability flapping. This morning
> particularly.  Missing from my snippet is the logic for an unplanned outage
> landing page, that our customers were seeing this morning, so it haproxy
> truly is "timing out" and marking each backend as down until there are no
> backend servers available, throwing up the unplanned outage landing page.
>
> I'll send more logs and details when I analyze later.
>
> Regards,
> Kevin Lange
>
>
> 
> Kevin M Lange
> Mission Operations and Services
> NASA EOSDIS Evolution and Development
> Intelligence and Information Systems
> Raytheon Company
>
> +1 (301) 851-8450 (office)
> +1 (301) 807-2457 (cell)
> kevin.m.la...@nasa.gov
> kla...@raytheon.com
>
> 5700 Rivertech Court
> Riverdale, Maryland 20737
>
> - Reply message -
> From: "Willy Tarreau" 
> Date: Thu, May 24, 2012 5:18 pm
> Subject: Problems with layer7 check timeout
> To: "Lange, Kevin M. (GSFC-423.0)[RAYTHEON COMPANY]"
> 
> Cc: "haproxy@formilux.org" 
>
> Hi Kevin,
>
> On Thu, May 24, 2012 at 04:04:03PM -0500, Lange, Kevin M.
> (GSFC-423.0)[RAYTHEON COMPANY] wrote:
>> Hi,
>> We're having odd behavior (apparently have always but didn't realize it),
>> where our backend httpchks "time out":
>>
>> May 24 04:03:33 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is
>> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup
>> servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
>> May 24 04:41:55 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is
>> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup
>> servers left. 2 sessions active, 0 requeued, 0 remaining in queue.
>> May 24 08:38:10 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is
>> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup
>> servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
>> May 24 08:53:37 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops2 is
>> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup
>> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
>> May 24 09:32:20 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops2 is
>> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup
>> servers left. 3 sessions active, 0 requeued, 0 remaining in queue.
>> May 24 09:35:01 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops3 is
>> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup
>> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
>> May 24 09:41:37 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops2 is
>> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup
>> servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
>> May 24 09:56:41 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops3 is
>> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup
>> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
>> May 24 10:01:45 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is
>> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup
>> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
>>
>>
>> We've been playing with the timeout values, and we don't know what is
>> controlling the "Layer7 timeout, check duration: 1002ms".  The backend
>> service availability check (by hand) typically takes 2-3 seconds on average.
>> Here is the relevant haproxy setup.
>>
>> #-
>> # Global settings
>> #-
>> global
>> log-send-hostname opsslb1
>> log 127.0.0.1 local1 info
>> #    chroot  /var/lib/haproxy
>> pidfile /var/run/haproxy.pid
>> maxconn 1024
>> user    haproxy
>> group   haproxy
>> daemon
>>
>> #-
>> # common defaults that all the 'listen' and 'backend' sections will
>> # use if not designated in their block
>> #-
>> defaults
>> mode    http
>> log global
>> option  dontlognull
>> option  httpclose
>> option  httplog
>> option  forwardfor
>> option  redispatch
>> timeout connect 500 # default 10 second time out if a backend is not
>> found
>> timeout client 5
>> timeout server 360
>> maxconn 6
>> retries 3
>>
>> frontend webapp_ops_ft
>>
>> bind 10.0.40.209:80
>> default_backend webapp_ops_bk
>>
>> backend web

Re: Problems with layer7 check timeout

2012-05-24 Thread Lange, Kevin M. (GSFC-423.0)[RAYTHEON COMPANY]
Monsieur Tarreau,

Actually, we are seeing frontend service availability flapping. This morning 
particularly.  Missing from my snippet is the logic for an unplanned outage 
landing page, that our customers were seeing this morning, so it haproxy truly 
is "timing out" and marking each backend as down until there are no backend 
servers available, throwing up the unplanned outage landing page.

I'll send more logs and details when I analyze later.

Regards,
Kevin Lange


Kevin M Lange
Mission Operations and Services
NASA EOSDIS Evolution and Development
Intelligence and Information Systems
Raytheon Company

+1 (301) 851-8450 (office)
+1 (301) 807-2457 (cell)
kevin.m.la...@nasa.gov
kla...@raytheon.com

5700 Rivertech Court
Riverdale, Maryland 20737

- Reply message -
From: "Willy Tarreau" 
Date: Thu, May 24, 2012 5:18 pm
Subject: Problems with layer7 check timeout
To: "Lange, Kevin M. (GSFC-423.0)[RAYTHEON COMPANY]" 
Cc: "haproxy@formilux.org" 

Hi Kevin,

On Thu, May 24, 2012 at 04:04:03PM -0500, Lange, Kevin M. (GSFC-423.0)[RAYTHEON 
COMPANY] wrote:
> Hi,
> We're having odd behavior (apparently have always but didn't realize it), 
> where our backend httpchks "time out":
>
> May 24 04:03:33 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is 
> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup 
> servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
> May 24 04:41:55 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is 
> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup 
> servers left. 2 sessions active, 0 requeued, 0 remaining in queue.
> May 24 08:38:10 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is 
> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup 
> servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
> May 24 08:53:37 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops2 is 
> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup 
> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
> May 24 09:32:20 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops2 is 
> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup 
> servers left. 3 sessions active, 0 requeued, 0 remaining in queue.
> May 24 09:35:01 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops3 is 
> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup 
> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
> May 24 09:41:37 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops2 is 
> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup 
> servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
> May 24 09:56:41 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops3 is 
> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup 
> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
> May 24 10:01:45 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is 
> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup 
> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
>
>
> We've been playing with the timeout values, and we don't know what is 
> controlling the "Layer7 timeout, check duration: 1002ms".  The backend 
> service availability check (by hand) typically takes 2-3 seconds on average.
> Here is the relevant haproxy setup.
>
> #-
> # Global settings
> #-
> global
> log-send-hostname opsslb1
> log 127.0.0.1 local1 info
> #chroot  /var/lib/haproxy
> pidfile /var/run/haproxy.pid
> maxconn 1024
> userhaproxy
> group   haproxy
> daemon
>
> #-
> # common defaults that all the 'listen' and 'backend' sections will
> # use if not designated in their block
> #-
> defaults
> modehttp
> log global
> option  dontlognull
> option  httpclose
> option  httplog
> option  forwardfor
> option  redispatch
> timeout connect 500 # default 10 second time out if a backend is not found
> timeout client 5
> timeout server 360
> maxconn 6
> retries 3
>
> frontend webapp_ops_ft
>
> bind 10.0.40.209:80
> default_backend webapp_ops_bk
>
> backend webapp_ops_bk
> balance roundrobin
> option httpchk HEAD /app/availability
> reqrep ^Host:.* Host:\ webapp.example.com
> server webapp_ops1 opsapp1.ops.example.com:41000 check inter 3
> server webapp_ops2 opsapp2.ops.example.com:41000 check inter 3
> server webapp_ops3 op

Re: Problems with layer7 check timeout

2012-05-24 Thread Lange, Kevin M. (GSFC-423.0)[RAYTHEON COMPANY]
I've already put an upgrade to haproxy in place.


Kevin M Lange
Mission Operations and Services
NASA EOSDIS Evolution and Development
Intelligence and Information Systems
Raytheon Company

+1 (301) 851-8450 (office)
+1 (301) 807-2457 (cell)
kevin.m.la...@nasa.gov
kla...@raytheon.com

5700 Rivertech Court
Riverdale, Maryland 20737

- Reply message -
From: "Willy Tarreau" 
Date: Thu, May 24, 2012 5:59 pm
Subject: Problems with layer7 check timeout
To: "Lange, Kevin M. (GSFC-423.0)[RAYTHEON COMPANY]" 
Cc: "haproxy@formilux.org" 

On Thu, May 24, 2012 at 04:31:39PM -0500, Lange, Kevin M. (GSFC-423.0)[RAYTHEON 
COMPANY] wrote:
> I thought it was a bug in the reporting, considering we've played with 
> numerous values for the various timeouts as an experiment, but wanted your 
> thoughts.
> This is v1.4.15.
>
>  [root@opsslb1 log]# haproxy -v
> HA-Proxy version 1.4.15 2011/04/08
> Copyright 2000-2010 Willy Tarreau 

OK, I'll try to reproduce. There have been a number of fixes since 1.4.15
BTW, but none of them look like what you observe. Still it would be
reasonable to consider an upgrade to 1.4.21.

Regards,
Willy



Re: Problems with layer7 check timeout

2012-05-24 Thread Lange, Kevin M. (GSFC-423.0)[RAYTHEON COMPANY]
Err...more precisely...
HA-Proxy version 1.4.15 2011/04/08
Copyright 2000-2010 Willy Tarreau 

Build options :
  TARGET  = linux26
  CPU = generic
  CC  = gcc
  CFLAGS  = -O2 -g -fno-strict-aliasing
  OPTIONS = USE_REGPARM=1 USE_PCRE=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200

Encrypted password support via crypt(3): yes

Available polling systems :
 sepoll : pref=400,  test result OK
  epoll : pref=300,  test result OK
   poll : pref=200,  test result OK
 select : pref=150,  test result OK
Total: 4 (4 usable), will use sepoll.

On May 24, 2012, at 5:31 PM, Lange, Kevin M. (GSFC-423.0)[RAYTHEON COMPANY] 
wrote:

> 
> 
> I thought it was a bug in the reporting, considering we've played with 
> numerous values for the various timeouts as an experiment, but wanted your 
> thoughts.
> This is v1.4.15.
> 
> [root@opsslb1 log]# haproxy -v
> HA-Proxy version 1.4.15 2011/04/08
> Copyright 2000-2010 Willy Tarreau 
> 
> On May 24, 2012, at 5:17 PM, Willy Tarreau wrote:
> 
>> Hi Kevin,
>> 
>> On Thu, May 24, 2012 at 04:04:03PM -0500, Lange, Kevin M. 
>> (GSFC-423.0)[RAYTHEON COMPANY] wrote:
>>> Hi,
>>> We're having odd behavior (apparently have always but didn't realize it), 
>>> where our backend httpchks "time out":
>>> 
>>> May 24 04:03:33 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is 
>>> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup 
>>> servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
>>> May 24 04:41:55 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is 
>>> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup 
>>> servers left. 2 sessions active, 0 requeued, 0 remaining in queue.
>>> May 24 08:38:10 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is 
>>> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup 
>>> servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
>>> May 24 08:53:37 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops2 is 
>>> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup 
>>> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
>>> May 24 09:32:20 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops2 is 
>>> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup 
>>> servers left. 3 sessions active, 0 requeued, 0 remaining in queue.
>>> May 24 09:35:01 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops3 is 
>>> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup 
>>> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
>>> May 24 09:41:37 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops2 is 
>>> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup 
>>> servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
>>> May 24 09:56:41 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops3 is 
>>> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup 
>>> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
>>> May 24 10:01:45 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is 
>>> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup 
>>> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
>>> 
>>> 
>>> We've been playing with the timeout values, and we don't know what is 
>>> controlling the "Layer7 timeout, check duration: 1002ms".  The backend 
>>> service availability check (by hand) typically takes 2-3 seconds on average.
>>> Here is the relevant haproxy setup.
>>> 
>>> #-
>>> # Global settings
>>> #-
>>> global
>>>   log-send-hostname opsslb1
>>>   log 127.0.0.1 local1 info
>>> #chroot  /var/lib/haproxy
>>>   pidfile /var/run/haproxy.pid
>>>   maxconn 1024
>>>   userhaproxy
>>>   group   haproxy
>>>   daemon
>>> 
>>> #-
>>> # common defaults that all the 'listen' and 'backend' sections will
>>> # use if not designated in their block
>>> #-
>>> defaults
>>>   modehttp
>>>   log global
>>>   option  dontlognull
>>>   option  httpclose
>>>   option  httplog
>>>   option  forwardfor
>>>   option  redispatch
>>>   timeout connect 500 # default 10 second time out if a backend is not found
>>>   timeout client 5
>>>   timeout server 360
>>>   maxconn 6
>>>   retries 3
>>> 
>>> frontend webapp_ops_ft
>>> 
>>>   bind 10.0.40.209:80
>>>   default_backend webapp_ops_bk
>>> 
>>> backend webapp_ops_bk
>>>   balance roundrobin
>>>   option httpchk HEAD /app/availability
>>>   reqrep ^Host:.* 

Re: Problems with layer7 check timeout

2012-05-24 Thread Willy Tarreau
On Thu, May 24, 2012 at 04:31:39PM -0500, Lange, Kevin M. (GSFC-423.0)[RAYTHEON 
COMPANY] wrote:
> I thought it was a bug in the reporting, considering we've played with 
> numerous values for the various timeouts as an experiment, but wanted your 
> thoughts.
> This is v1.4.15.
> 
>  [root@opsslb1 log]# haproxy -v
> HA-Proxy version 1.4.15 2011/04/08
> Copyright 2000-2010 Willy Tarreau 

OK, I'll try to reproduce. There have been a number of fixes since 1.4.15
BTW, but none of them look like what you observe. Still it would be
reasonable to consider an upgrade to 1.4.21.

Regards,
Willy




Re: Problems with layer7 check timeout

2012-05-24 Thread Lange, Kevin M. (GSFC-423.0)[RAYTHEON COMPANY]
I thought it was a bug in the reporting, considering we've played with numerous 
values for the various timeouts as an experiment, but wanted your thoughts.
This is v1.4.15.

 [root@opsslb1 log]# haproxy -v
HA-Proxy version 1.4.15 2011/04/08
Copyright 2000-2010 Willy Tarreau 

On May 24, 2012, at 5:17 PM, Willy Tarreau wrote:

> Hi Kevin,
> 
> On Thu, May 24, 2012 at 04:04:03PM -0500, Lange, Kevin M. 
> (GSFC-423.0)[RAYTHEON COMPANY] wrote:
>> Hi,
>> We're having odd behavior (apparently have always but didn't realize it), 
>> where our backend httpchks "time out":
>> 
>> May 24 04:03:33 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is 
>> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup 
>> servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
>> May 24 04:41:55 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is 
>> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup 
>> servers left. 2 sessions active, 0 requeued, 0 remaining in queue.
>> May 24 08:38:10 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is 
>> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup 
>> servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
>> May 24 08:53:37 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops2 is 
>> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup 
>> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
>> May 24 09:32:20 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops2 is 
>> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup 
>> servers left. 3 sessions active, 0 requeued, 0 remaining in queue.
>> May 24 09:35:01 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops3 is 
>> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup 
>> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
>> May 24 09:41:37 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops2 is 
>> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup 
>> servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
>> May 24 09:56:41 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops3 is 
>> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup 
>> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
>> May 24 10:01:45 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is 
>> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup 
>> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
>> 
>> 
>> We've been playing with the timeout values, and we don't know what is 
>> controlling the "Layer7 timeout, check duration: 1002ms".  The backend 
>> service availability check (by hand) typically takes 2-3 seconds on average.
>> Here is the relevant haproxy setup.
>> 
>> #-
>> # Global settings
>> #-
>> global
>>log-send-hostname opsslb1
>>log 127.0.0.1 local1 info
>> #chroot  /var/lib/haproxy
>>pidfile /var/run/haproxy.pid
>>maxconn 1024
>>userhaproxy
>>group   haproxy
>>daemon
>> 
>> #-
>> # common defaults that all the 'listen' and 'backend' sections will
>> # use if not designated in their block
>> #-
>> defaults
>>modehttp
>>log global
>>option  dontlognull
>>option  httpclose
>>option  httplog
>>option  forwardfor
>>option  redispatch
>>timeout connect 500 # default 10 second time out if a backend is not found
>>timeout client 5
>>timeout server 360
>>maxconn 6
>>retries 3
>> 
>> frontend webapp_ops_ft
>> 
>>bind 10.0.40.209:80
>>default_backend webapp_ops_bk
>> 
>> backend webapp_ops_bk
>>balance roundrobin
>>option httpchk HEAD /app/availability
>>reqrep ^Host:.* Host:\ webapp.example.com
>>server webapp_ops1 opsapp1.ops.example.com:41000 check inter 3
>>server webapp_ops2 opsapp2.ops.example.com:41000 check inter 3
>>server webapp_ops3 opsapp3.ops.example.com:41000 check inter 3
>>timeout check 15000
>>timeout connect 15000
> 
> This is quite strange. The timeout is defined first by "timeout check" or if
> unset, by "inter". So in your case you should observe a 15sec timeout, not
> one second.
> 
> What exact version is this ? (haproxy -vv)
> 
> It looks like a bug, however it could be a bug in the timeout handling as
> well as in the reporting. I'd suspect the latter since you're saying that
> the service takes 2-3 sec to respond and you don't seem to see errors
> that often.
> 

Re: Problems with layer7 check timeout

2012-05-24 Thread Willy Tarreau
Hi Kevin,

On Thu, May 24, 2012 at 04:04:03PM -0500, Lange, Kevin M. (GSFC-423.0)[RAYTHEON 
COMPANY] wrote:
> Hi,
> We're having odd behavior (apparently have always but didn't realize it), 
> where our backend httpchks "time out":
> 
> May 24 04:03:33 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is 
> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup 
> servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
> May 24 04:41:55 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is 
> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup 
> servers left. 2 sessions active, 0 requeued, 0 remaining in queue.
> May 24 08:38:10 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is 
> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup 
> servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
> May 24 08:53:37 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops2 is 
> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup 
> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
> May 24 09:32:20 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops2 is 
> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup 
> servers left. 3 sessions active, 0 requeued, 0 remaining in queue.
> May 24 09:35:01 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops3 is 
> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup 
> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
> May 24 09:41:37 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops2 is 
> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup 
> servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
> May 24 09:56:41 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops3 is 
> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup 
> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
> May 24 10:01:45 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is 
> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup 
> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
> 
> 
> We've been playing with the timeout values, and we don't know what is 
> controlling the "Layer7 timeout, check duration: 1002ms".  The backend 
> service availability check (by hand) typically takes 2-3 seconds on average.
> Here is the relevant haproxy setup.
> 
> #-
> # Global settings
> #-
> global
> log-send-hostname opsslb1
> log 127.0.0.1 local1 info
> #chroot  /var/lib/haproxy
> pidfile /var/run/haproxy.pid
> maxconn 1024
> userhaproxy
> group   haproxy
> daemon
> 
> #-
> # common defaults that all the 'listen' and 'backend' sections will
> # use if not designated in their block
> #-
> defaults
> modehttp
> log global
> option  dontlognull
> option  httpclose
> option  httplog
> option  forwardfor
> option  redispatch
> timeout connect 500 # default 10 second time out if a backend is not found
> timeout client 5
> timeout server 360
> maxconn 6
> retries 3
> 
> frontend webapp_ops_ft
> 
> bind 10.0.40.209:80
> default_backend webapp_ops_bk
> 
> backend webapp_ops_bk
> balance roundrobin
> option httpchk HEAD /app/availability
> reqrep ^Host:.* Host:\ webapp.example.com
> server webapp_ops1 opsapp1.ops.example.com:41000 check inter 3
> server webapp_ops2 opsapp2.ops.example.com:41000 check inter 3
> server webapp_ops3 opsapp3.ops.example.com:41000 check inter 3
> timeout check 15000
> timeout connect 15000

This is quite strange. The timeout is defined first by "timeout check" or if
unset, by "inter". So in your case you should observe a 15sec timeout, not
one second.

What exact version is this ? (haproxy -vv)

It looks like a bug, however it could be a bug in the timeout handling as
well as in the reporting. I'd suspect the latter since you're saying that
the service takes 2-3 sec to respond and you don't seem to see errors
that often.

Regards,
Willy