Hi Lange, Would it be possible to take a trace (tcpdump) of the health check? This may help as well.
Cheers On Fri, May 25, 2012 at 4:01 AM, Lange, Kevin M. (GSFC-423.0)[RAYTHEON COMPANY] <kevin.m.la...@nasa.gov> wrote: > Monsieur Tarreau, > > Actually, we are seeing frontend service availability flapping. This morning > particularly. Missing from my snippet is the logic for an unplanned outage > landing page, that our customers were seeing this morning, so it haproxy > truly is "timing out" and marking each backend as down until there are no > backend servers available, throwing up the unplanned outage landing page. > > I'll send more logs and details when I analyze later. > > Regards, > Kevin Lange > > > ---- > Kevin M Lange > Mission Operations and Services > NASA EOSDIS Evolution and Development > Intelligence and Information Systems > Raytheon Company > > +1 (301) 851-8450 (office) > +1 (301) 807-2457 (cell) > kevin.m.la...@nasa.gov > kla...@raytheon.com > > 5700 Rivertech Court > Riverdale, Maryland 20737 > > ----- Reply message ----- > From: "Willy Tarreau" <w...@1wt.eu> > Date: Thu, May 24, 2012 5:18 pm > Subject: Problems with layer7 check timeout > To: "Lange, Kevin M. (GSFC-423.0)[RAYTHEON COMPANY]" > <kevin.m.la...@nasa.gov> > Cc: "haproxy@formilux.org" <haproxy@formilux.org> > > Hi Kevin, > > On Thu, May 24, 2012 at 04:04:03PM -0500, Lange, Kevin M. > (GSFC-423.0)[RAYTHEON COMPANY] wrote: >> Hi, >> We're having odd behavior (apparently have always but didn't realize it), >> where our backend httpchks "time out": >> >> May 24 04:03:33 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is >> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup >> servers left. 1 sessions active, 0 requeued, 0 remaining in queue. >> May 24 04:41:55 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is >> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup >> servers left. 2 sessions active, 0 requeued, 0 remaining in queue. >> May 24 08:38:10 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is >> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup >> servers left. 1 sessions active, 0 requeued, 0 remaining in queue. >> May 24 08:53:37 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops2 is >> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup >> servers left. 0 sessions active, 0 requeued, 0 remaining in queue. >> May 24 09:32:20 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops2 is >> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup >> servers left. 3 sessions active, 0 requeued, 0 remaining in queue. >> May 24 09:35:01 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops3 is >> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup >> servers left. 0 sessions active, 0 requeued, 0 remaining in queue. >> May 24 09:41:37 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops2 is >> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup >> servers left. 1 sessions active, 0 requeued, 0 remaining in queue. >> May 24 09:56:41 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops3 is >> DOWN, reason: Layer7 timeout, check duration: 1002ms. 0 active and 0 backup >> servers left. 0 sessions active, 0 requeued, 0 remaining in queue. >> May 24 10:01:45 opsslb1 haproxy[4594]: Server webapp_ops_bk/webapp_ops1 is >> DOWN, reason: Layer7 timeout, check duration: 1001ms. 0 active and 0 backup >> servers left. 0 sessions active, 0 requeued, 0 remaining in queue. >> >> >> We've been playing with the timeout values, and we don't know what is >> controlling the "Layer7 timeout, check duration: 1002ms". The backend >> service availability check (by hand) typically takes 2-3 seconds on average. >> Here is the relevant haproxy setup. >> >> #--------------------------------------------------------------------- >> # Global settings >> #--------------------------------------------------------------------- >> global >> log-send-hostname opsslb1 >> log 127.0.0.1 local1 info >> # chroot /var/lib/haproxy >> pidfile /var/run/haproxy.pid >> maxconn 1024 >> user haproxy >> group haproxy >> daemon >> >> #--------------------------------------------------------------------- >> # common defaults that all the 'listen' and 'backend' sections will >> # use if not designated in their block >> #--------------------------------------------------------------------- >> defaults >> mode http >> log global >> option dontlognull >> option httpclose >> option httplog >> option forwardfor >> option redispatch >> timeout connect 500 # default 10 second time out if a backend is not >> found >> timeout client 50000 >> timeout server 3600000 >> maxconn 60000 >> retries 3 >> >> frontend webapp_ops_ft >> >> bind 10.0.40.209:80 >> default_backend webapp_ops_bk >> >> backend webapp_ops_bk >> balance roundrobin >> option httpchk HEAD /app/availability >> reqrep ^Host:.* Host:\ webapp.example.com >> server webapp_ops1 opsapp1.ops.example.com:41000 check inter 30000 >> server webapp_ops2 opsapp2.ops.example.com:41000 check inter 30000 >> server webapp_ops3 opsapp3.ops.example.com:41000 check inter 30000 >> timeout check 15000 >> timeout connect 15000 > > This is quite strange. The timeout is defined first by "timeout check" or if > unset, by "inter". So in your case you should observe a 15sec timeout, not > one second. > > What exact version is this ? (haproxy -vv) > > It looks like a bug, however it could be a bug in the timeout handling as > well as in the reporting. I'd suspect the latter since you're saying that > the service takes 2-3 sec to respond and you don't seem to see errors > that often. > > Regards, > Willy >