Hi Lange,

Would it be possible to take a trace (tcpdump) of the health check?
This may help as well.

Cheers


On Fri, May 25, 2012 at 4:01 AM, Lange, Kevin M. (GSFC-423.0)[RAYTHEON
COMPANY] <kevin.m.la...@nasa.gov> wrote:
> Monsieur Tarreau,
>
> Actually, we are seeing frontend service availability flapping. This morning
> particularly.  Missing from my snippet is the logic for an unplanned outage
> landing page, that our customers were seeing this morning, so it haproxy
> truly is "timing out" and marking each backend as down until there are no
> backend servers available, throwing up the unplanned outage landing page.
>
> I'll send more logs and details when I analyze later.
>
> Regards,
> Kevin Lange
>
>
> ----
> Kevin M Lange
> Mission Operations and Services
> NASA EOSDIS Evolution and Development
> Intelligence and Information Systems
> Raytheon Company
>
> +1 (301) 851-8450 (office)
> +1 (301) 807-2457 (cell)
> kevin.m.la...@nasa.gov
> kla...@raytheon.com
>
> 5700 Rivertech Court
> Riverdale, Maryland 20737
>
> ----- Reply message -----
> From: "Willy Tarreau" <w...@1wt.eu>
> Date: Thu, May 24, 2012 5:18 pm
> Subject: Problems with layer7 check timeout
> To: "Lange, Kevin M. (GSFC-423.0)[RAYTHEON COMPANY]"
> <kevin.m.la...@nasa.gov>
> Cc: "haproxy@formilux.org" <haproxy@formilux.org>
>
> Hi Kevin,
>
> On Thu, May 24, 2012 at 04:04:03PM -0500, Lange, Kevin M.
> (GSFC-423.0)[RAYTHEON COMPANY] wrote:
>> Hi,
>> We're having odd behavior (apparently have always but didn't realize it),
>> where our backend httpchks "time out":
>>
>> May 24 04:03:33 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is
>> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup
>> servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
>> May 24 04:41:55 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is
>> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup
>> servers left. 2 sessions active, 0 requeued, 0 remaining in queue.
>> May 24 08:38:10 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is
>> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup
>> servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
>> May 24 08:53:37 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops2 is
>> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup
>> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
>> May 24 09:32:20 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops2 is
>> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup
>> servers left. 3 sessions active, 0 requeued, 0 remaining in queue.
>> May 24 09:35:01 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops3 is
>> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup
>> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
>> May 24 09:41:37 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops2 is
>> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup
>> servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
>> May 24 09:56:41 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops3 is
>> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup
>> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
>> May 24 10:01:45 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is
>> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup
>> servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
>>
>>
>> We've been playing with the timeout values, and we don't know what is
>> controlling the "Layer7 timeout, check duration: 1002ms".  The backend
>> service availability check (by hand) typically takes 2-3 seconds on average.
>> Here is the relevant haproxy setup.
>>
>> #---------------------------------------------------------------------
>> # Global settings
>> #---------------------------------------------------------------------
>> global
>>     log-send-hostname opsslb1
>>     log         127.0.0.1 local1 info
>> #    chroot      /var/lib/haproxy
>>     pidfile     /var/run/haproxy.pid
>>     maxconn     1024
>>     user        haproxy
>>     group       haproxy
>>     daemon
>>
>> #---------------------------------------------------------------------
>> # common defaults that all the 'listen' and 'backend' sections will
>> # use if not designated in their block
>> #---------------------------------------------------------------------
>> defaults
>>     mode        http
>>     log         global
>>     option      dontlognull
>>     option      httpclose
>>     option      httplog
>>     option      forwardfor
>>     option      redispatch
>>     timeout connect 500 # default 10 second time out if a backend is not
>> found
>>     timeout client 50000
>>     timeout server 3600000
>>     maxconn     60000
>>     retries     3
>>
>> frontend webapp_ops_ft
>>
>>         bind 10.0.40.209:80
>>         default_backend webapp_ops_bk
>>
>> backend webapp_ops_bk
>>         balance roundrobin
>>         option httpchk HEAD /app/availability
>>         reqrep ^Host:.* Host:\ webapp.example.com
>>         server webapp_ops1 opsapp1.ops.example.com:41000 check inter 30000
>>         server webapp_ops2 opsapp2.ops.example.com:41000 check inter 30000
>>         server webapp_ops3 opsapp3.ops.example.com:41000 check inter 30000
>>         timeout check 15000
>>         timeout connect 15000
>
> This is quite strange. The timeout is defined first by "timeout check" or if
> unset, by "inter". So in your case you should observe a 15sec timeout, not
> one second.
>
> What exact version is this ? (haproxy -vv)
>
> It looks like a bug, however it could be a bug in the timeout handling as
> well as in the reporting. I'd suspect the latter since you're saying that
> the service takes 2-3 sec to respond and you don't seem to see errors
> that often.
>
> Regards,
> Willy
>

Reply via email to