Unfortunately retry doesn't work in our case as we run haproxy on 2 layers, frontend servers and backend servers (to distribute traffic among multiple processes on each server). So when an app on a server goes down, the haproxy on that server is still up and accepting connections, but the layer 7 http checks from the frontend haproxy are failing. But since the backend haproxy is still accepting connections, the retry option does not work.
-Patrick ------------------------------------------------------------------------ *From: *Baptiste <bed...@gmail.com> *Sent: * 2014-02-24 07:18:00 E *To: *Malcolm Turnbull <malc...@loadbalancer.org> *CC: *Neil <n...@iamafreeman.com>, Patrick Hemmer <hapr...@stormcloud9.net>, HAProxy <haproxy@formilux.org> *Subject: *Re: Just a simple thought on health checks after a soft reload of HAProxy.... > Hi Malcolm, > > Hence the retry and redispatch options :) > I know it's a dirty workaround. > > Baptiste > > > On Sun, Feb 23, 2014 at 8:42 PM, Malcolm Turnbull > <malc...@loadbalancer.org> wrote: >> Neil, >> >> Yes, peers are great for passing stick tables to the new HAProxy >> instance and any current connections bound to the old process will be >> fine. >> However any new connections will hit the new HAProxy process and if >> the backend server is down but haproxy hasn't health checked it yet >> then the user will hit a failed server. >> >> >> >> On 23 February 2014 10:38, Neil <n...@iamafreeman.com> wrote: >>> Hello >>> >>> Regarding restarts, rather that cold starts, if you configure peers the >>> state from before the restart should be kept. The new process haproxy >>> creates is automatically a peer to the existing process and gets the state >>> as was. >>> >>> Neil >>> >>> On 23 Feb 2014 03:46, "Patrick Hemmer" <hapr...@stormcloud9.net> wrote: >>>> >>>> >>>> >>>> ________________________________ >>>> From: Sok Ann Yap <sok...@gmail.com> >>>> Sent: 2014-02-21 05:11:48 E >>>> To: haproxy@formilux.org >>>> Subject: Re: Just a simple thought on health checks after a soft reload of >>>> HAProxy.... >>>> >>>> Patrick Hemmer <haproxy@...> writes: >>>> >>>> From: Willy Tarreau <w <at> 1wt.eu> >>>> >>>> Sent: 2014-01-25 05:45:11 E >>>> >>>> Till now that's exactly what's currently done. The servers are marked >>>> "almost dead", so the first check gives the verdict. Initially we had >>>> all checks started immediately. But it caused a lot of issues at several >>>> places where there were a high number of backends or servers mapped to >>>> the same hardware, because the rush of connection really caused the >>>> servers to be flagged as down. So we started to spread the checks over >>>> the longest check period in a farm. >>>> >>>> Is there a way to enable this behavior? In my >>>> environment/configuration, it causes absolutely no issue that all >>>> the checks be fired off at the same time. >>>> As it is right now, when haproxy starts up, it takes it quite a >>>> while to discover which servers are down. >>>> -Patrick >>>> >>>> I faced the same problem in http://thread.gmane.org/ >>>> gmane.comp.web.haproxy/14644 >>>> >>>> After much contemplation, I decided to just patch away the initial spread >>>> check behavior: https://github.com/sayap/sayap-overlay/blob/master/net- >>>> proxy/haproxy/files/haproxy-immediate-first-check.diff >>>> >>>> >>>> >>>> I definitely think there should be an option to disable the behavior. We >>>> have an automated system which adds and removes servers from the config, >>>> and >>>> then bounces haproxy. Every time haproxy is bounced, we have a period where >>>> it can send traffic to a dead server. >>>> >>>> >>>> There's also a related bug on this. >>>> The bug is that when I have a config with "inter 30s fastinter 1s" and no >>>> httpchk enabled, when haproxy first starts up, it spreads the checks over >>>> the period defined as fastinter, but the stats output says "UP 1/3" for the >>>> full 30 seconds. It also says "L4OK in 30001ms", when I know it doesn't >>>> take >>>> the server 30 seconds to simply accept a connection. >>>> Yet you get different behavior when using httpchk. When I add "option >>>> httpchk", it still spreads the checks over the 1s fastinter value, but the >>>> stats output goes full "UP" immediately after the check occurs, not "UP >>>> 1/3". It also says "L7OK/200 in 0ms", which is what I expect to see. >>>> >>>> -Patrick >>>> >> >> >> -- >> Regards, >> >> Malcolm Turnbull. >> >> Loadbalancer.org Ltd. >> Phone: +44 (0)870 443 8779 >> http://www.loadbalancer.org/ >>