Re: [PATCH 5/5] dynamic health check

Simon Horman Mon, 04 Feb 2013 19:06:02 -0800

On Fri, Feb 01, 2013 at 08:22:24AM +0100, Willy Tarreau wrote:
> Hi Simon,
> 
> On Fri, Feb 01, 2013 at 01:56:01PM +0900, Simon Horman wrote:
> > Hi Malcolm, Hi Willy,
> > 
> > after a bit of a hiatus I'd like to restart this discussion.
> 
> Cool, I wanted to ping you on this last week-end but forgot to do so !
> 
> > On Mon, Dec 24, 2012 at 10:23:15AM +0100, Willy Tarreau wrote:
> > > Hi Malcolm,
> > > 
> > > On Mon, Dec 24, 2012 at 09:06:25AM +0000, Malcolm Turnbull wrote:
> > > > Willy / Simon,
> > > > 
> > > > I'm very happy to add a down option, my original thought was that you
> > > > would use the standard health checks as well as the dynamic agent for
> > > > changing the weight.
> > > 
> > > That's what I thought I initially understood from our discussion a few
> > > months ago but then your post of the specs last week slightly confused
> > > me as I understood you needed this as a dedicated check. I think it was
> > > the same for Simon.
> > 
> > Sorry, I think that the problem here lies in my understanding of what is
> > desired.
> 
> No problem, we were several ones to get confused.
> 
> > > > As you may for example want a specific HAproxy SMTP health check + use
> > > > the dynamic weighting agent.
> > > 
> > > Exactly. But then we have two options :
> > >   - retrieve the information from the checked port (easy for HTTP or TCP)
> > >   - retrieve the information from a dedicated port => this involves a
> > >     second task to do this, with its own check intervals.
> > > 
> > > The latter doesn't seem stupid at all, quite the opposite in fact, but
> > > it will require more settings on the server line. However it comes with
> > > a benefit, it is that when the agent returns "disable", checks are
> > > disabled on the real port, but then we could have the agent continue to
> > > be checked and later return a valid result again.
> > >
> > > > I'm not sure if that would cause some coding issues if the health
> > > > checks say 'Down' and the agent says 50%? (I would assume haproxy
> > > > health checks take priority?)
> > > 
> > > Status and weights are orthogonal. The real check should have precedence.
> > > 
> > > > Or if the agent says Down but the HAProxy health check says up?
> > > 
> > > I think it should be ANDed. This could help provide a first implementation
> > > of multi-port checks after all.
> > 
> > That sounds reasonable.
> > 
> > > > I've certainly happy for Down to be added as an option with a
> > > > description string.
> > > > Also I'm assuming that later (the dynamic agent) could easily be
> > > > extended to an http style get check rather than TCP (lb-agent-chk)  if
> > > > users prefer to write an HTTP server application to integrate with it
> > > > (Kemp and Barracuda support this method).
> > 
> > On the topic of of down. I think that Willy's proposal is
> > entirely reasonable. However its unclear to me if disable should also
> > be supported or not.
> 
> The disable mode is very problematic : if a server accidently returns it,
> there is no way to roll back except a manual intervention on the load
> balancers. Also there is a high risk that the backup LB will be forgotten
> in such an operation. I have no technical worries here, just operational
> ones. If we run agent checks on a dedicated port in parallel to health
> checks, this is different, because we could ensure that such checks could
> still be running when the server is disabled so that the agent can change
> the mode again. So maybe a first version should not support disable and a
> later one could support it ?


This seems reasonable to me.

> Also, I believe that in another thread we discussed about supporting a
> new status (eg: STOPPED) which differs from DOWN in that it means the
> service was intentionally stopped and did not crash. We can't support
> this well right now (just map it do down) but I think it's important
> that people can design their agents for this. Similarly, a "FAIL"
> status could be useful in the usual situations where a server is inoperant
> due to external conditions but could appear valid. The common example is
> the mail server which fails to receive e-mails because the FS is full.
> Everything works except the service cannot be delivered. There is nothing
> to restart, the issue can go away by itself, etc... We'd map this to DOWN
> again, but I think some users may later prefer to have a dedicated status
> in the agent's language. So we should probably plan it in the language in
> order to avoid ugly patches here and there.

Adding stopped and fail, and mapping them both to down seems reasonable to me.
I assume that they also accept reason strings as down does.

> > > That's what I'm commonly observing too. Even right now, there are a lot
> > > of users who use httpchk for services that are not HTTP at all, but they
> > > have a very simple agent responding to checks.
> > > 
> > > So now we have to decide what to do. I think Simon's code already provides
> > > some useful features (assuming we support "down"). It should probably be
> > > extended later to support combined checks.
> > > 
> > > In my opinion, this could be done in three steps :
> > > 
> > >   1) we merge Simon's work with the "option lb-agent-chk" directive which
> > >      *replaces* the health check method with this one ;
> > > 
> > >   2) we implement "agent-port" and "agent-interval" on the server lines to
> > >      automatically enable the agent to be run on another port even when a
> > >      different check is running ;
> > > 
> > >   3) we implement "http-check agent-hdr <name>" to retrieve the agent 
> > > string
> > >      from an HTTP header for HTTP checks ;
> > > 
> > > That way we always support exactly the same syntax but can retrieve the
> > > required information at different places depending on the checks. Does
> > > that sound good to you ?
> > 
> > That sounds entirely reasonable to me.
> 
> Nice!
> 
> Best regards,
> Willy
> 
>

Re: [PATCH 5/5] dynamic health check

Reply via email to