Re: Health check hell

Willy Tarreau Mon, 02 Dec 2013 06:31:48 -0800

Hi Simon,

thank you for your response, I felt a little bit alone :-)

On Mon, Dec 02, 2013 at 08:56:31PM +0900, Simon Horman wrote:
> On Thu, Nov 28, 2013 at 03:41:15PM +0100, Willy Tarreau wrote:
> > Hi guys,
> > 
> > I'm CCing the persons who've been most involved in the evolutions of the
> > health check system and who might have strong opinions about what to take
> > care of.
> > 
> > The recent inclusion of the agent-check has unveiled how much the current
> > check subsystem is a complex mess full of corner cases. Igor sent me some
> > screen captures of abnormal stats pages with servers marked DRAIN in brown
> > while they were just set to weight 0 in the config, etc... The reason is
> > the ambiguity we have in defining states because we have added more and
> > more exceptions and some combinations are not properly documented. Thus
> > I'm proposing to perform some changes to remain compatible with what we
> > had till now and ensure that both the agent and the CLI work in a coherent
> > and understandable way.
> > 
> > First, brief summary of the current situation :
> > 
> >   - MAINT state has the highest precedence. It can be enforced from the CLI.
> >     A server can be turned from any state to MAINT and the stats page will
> >     report MAINT. Checks are automatically disabled in this state. This
> >     state may be inherited from tracked servers. The stats page reports such
> >     servers in brown whatever their previous check results. Technically,
> >     this state is represented as a flag on the server which is checked 
> > before
> >     everything else.
> > 
> >   - DRAIN is the state where either the user (via the config or CLI) or the
> >     agent explicitly forces the server weight to zero. It has the second
> >     highest precedence since it can be enforced from the CLI and is 
> > persistent.
> >     This state should appear only on servers which are technically UP, so 
> > they
> >     can still receive some traffic. In practice we don't need to "store" the
> >     DRAIN state, a server should be considered in this state when it's UP 
> > (or
> >     unchecked) and its weight is zero. It's important to keep a special 
> > color
> >     for this case (currently blue) on the stats page for this. Writing 
> > "DRAIN"
> >     instead of "UP" also helps spotting it.
> > 
> >   - UNCHECKED is the state where the server is enabled and never performs
> >     any health checks. It's not in DRAIN state either when reported in this
> >     state. That's the gray state on the stats page.
> > 
> >   - NOLB cannot be forced from the CLI nor the agent. It's equivalent to a
> >     DRAIN mode except that it is deduced from the result of a health check
> >     ("404") and does not affect the weight. It is maintained until the check
> >     reports a different state, or until the server goes down, where it
> >     automatically clears. It may be inherited from tracked servers. It is 
> > only
> >     used in HTTP mode with "http-check disable-on-404" at the moment.
> > 
> >   - UP is the state where the server is consistently seen as OK without any 
> > of
> >     the exceptions above. This state is altered by health checks. The agent
> >     might switch away from it, until a new check changes this. The CLI must
> >     provide the ability to do the same. The agent can currently force the
> >     server to be seen up by emitting a weighted percentage.
> > 
> >   - UP/GOINGDOWN is the state where the server was previously seen as OK but
> >     recently failed less than "fail" checks. It's without any of the 
> > exceptions
> >     above.
> > 
> >   - DOWN is the state where the server is consistently seen as KO without 
> > any
> >     of the exceptions above. The agent must be able to temporarily force the
> >     server into this state until next health check might change it again. 
> > The
> >     CLI must provide the ability to do the same.
> > 
> >   - DOWN/GOINGUP is the state where the server was previously seen as KO but
> >     recently succeeded less than "rise" checks. It's without any of the
> >     exceptions above.
> > 
> > Right now MAINT state is propagated from tracked servers, NOLB is propagated
> > as well, but not DRAIN. Changing a server's weight does not affect the 
> > tracking
> > servers' weight, and it definitely must not.
> > 
> > At the moment, only checked servers may be tracked, but since we can now
> > enable/disable a server, it would make sense to allow a server to track
> > unchecked servers as well so that a single "enable" or "disable" applies
> > to the whole list of trackers.
> > 
> > Right now what is propagated across tracked servers is :
> >   - MAINT
> >   - NOLB
> >   - UP/DOWN and DOWN/UP transitions
> > 
> > We should consider that the agent provides exactly the same capabilities as 
> > the
> > CLI, because it is used to alter the server's behaviour beyond what the 
> > config
> > plans, exactly as the CLI does. This means several things :
> > 
> >   - weights are per-server, so a weight change learned from an agent is not
> >     propagated to tracking servers.
> > 
> >   - CLI needs the ability to set a server up or down just like the agent. 
> > This
> >     is currently not possible.
> > 
> >   - CLI's set weight does not turn the server up while agent's weight turns 
> > it
> >     on, I think we need to align the agent on the CLI here.
> > 
> >   - we'll later have to add a new directive "agent-track", comparable to 
> > "track"
> >     to propagate agent changes to tracking servers.
> > 
> >   - CLI always has the final word because from the CLI we can disable the 
> > agent.
> > 
> > The "DRAIN" state is very similar to the NOLB state except that it 
> > explicitly
> > forces the weight to zero, causing the loss of the previous weight.
> > 
> > So probably we should change a few things in the agent :
> >   - have the weight announces not change the server's state, just the 
> > weight,
> >     just like the CLI. This is useful to announce the server's load only
> >     without interfering with checks ;
> > 
> >   - have DRAIN and NOLB be exactly the same thing. That means that an agent
> >     responds DRAIN when it just wants the server not to receive new 
> > connections
> >     regardless of its operating state. This state will be ignored when the
> >     server is already down, and DOWN will follow.
> 
> Its unclear to me what the difference would be between DRAIN/NOLB
> and setting the weight to 0. Is the difference that the weight would
> be retained?

They're almost the same but the difference is elsewhere. At the moment, NOLB
exists to cover the disable-on-404 feature, it's solely set and removed by
health checks :

          404           5xx
        -------> NOLB ---------> DOWN
    OK  <---------'                |
           2xx                     | 2xx
        <--------------------------+

The DRAIN state is similar except that it's returned by the agent. Its
purpose is to provide the same service, except that you don't want the
regular health checks to disable it.

All the trouble comes from the coexistence of regular health checks and
agent, because we need a special state for the agent's DRAIN, so that it
persists whatever regular health checks do (which is why you have a flag
for this).

I would love to be able to merge the two (and I prefer the term DRAIN to
NOLB) but I don't see how it is possible. Or maybe we should make the
disable-on-404 option incompatible with agent check ? After all, this
feature was a preliminary agent embedded in the application.

> It is also not clear to me at what point DOWN would follow.

My understanding is that UP/DOWN/NOLB are application level and depend
on the service being checked. DOWN maybe detected by the regular health
check or enforced by the agent.

> >   - support an "up" command to immediately turn the server up and reverse 
> > the
> >     effects of "down", allowing it to run without health checks and just the
> >     agent.
> 
> It is not clear to me how up would work in a situation where a server
> had been set to NOLB/DRAIN but was not yet DOWN. This might be
> because I don't understand how to transition from NOLB/DRAIN to DOWN
> would occur.

UP after a DOWN would go to the UP state in that the NOLB/DRAIN state is
hephemeral and does not replace the server's configured weight.

I think there are two use cases of the agent :

  - inform the load balancer about the server's load to adjust the
    weights, but not interact with the service's state which is
    monitored using regular checks. It basically replaces the job
    of the admin who would constantly re-adjust weights depending
    on the servers load.

  - offer a complete health check system to services which are not
    easily checkable. In this case they would simply be used without
    a regular check. This is more a service-level approach and not
    a server-level one.

Both are compatible in my opinion, provided that we work in event-mode,
which is actually what you have done.

However we need to handle some conflicts. DRAIN is used for the service
but changes the server weight in a way that is hard to propagate and
not easily reversable. I'm thinking about a very simple script which
would return the idle CPU measure in the weight. That would mean that
an unused server would have a weight of 100 and a fully loaded one a
weight of zero. An half-loaded one would have 50. It's not optimal but
it's easy and efficient enough. In this case it's not possible to use
the current DRAIN mode to temporarily disable the service because the
next measure of idle CPU would automatically re-enable the service and
cause a leave from the DRAIN state.

Similarly, we cannot propagate the DRAIN state across tracked servers
right now because we don't propagate weights (which are per server and
not per service).

Note, some people might possibly argue that we should propagate weight
changes instead. After all, we already propagate NOLB and MAINT. So it
would not seem stupid to propagate weight changes expressed as a
percentage. Eg :

  backend b1
     server s1 1.1.1.1:80 weight 100 check agent-check

  backend b2
     server s1 1.1.1.1:443 weight 50 track b1/s1

So when b1/s1 returns "50%", it would automatically set b2/s1 to 25.
Similarly when the admin sets b1/s1 to "50" on the CLI, it would change
b2/s1 to 25 as well.

Also right now if an admin does "set weight b1/s1 30", it does not force
the server up if it was down, it only changes its weight.

The main problem we have in fact is that there are multiple ways of
acting on a server's state :
  - regular health check
  - tracked server's health check
  - agent check
  - config (weight, "disabled" keyword)
  - CLI (which can change the weight and the enabled/disabled state)

The agent tends to fit in the middle of the two sides (checks and config).
That's why I'm thinking it probably has the two purposes. It is advisable
that some users will want to report "up 30%" while others will just want
to report the server's load "30%" without affecting the state. One good
example could be :

  backend http
     option httpchk
     server s1 1.1.1.1:80 check agent-check

  backend https
     option httpchk
     server s1 1.1.1.1:443 check ssl agent-track b1/s1

I think this is a realistic configuration since most people won't deploy
multiple agents on a single server, but only one to measure the server's
health.

In the config above, s1 would exist both in HTTP and HTTPS, each with
their own health check and the agent would only return a the server's
load as a weight that is propagated to all servers without affecting
their state, just the relative weights. The problem becomes more obvious
when you stop one service only. Eg, you stop https, but http continues
to work and measures more idle CPU which can be reported as a weight.
You clearly don't want this advertised weight to make the https farm
think that the state is up again.

If we do something like this, then DRAIN could be a service health
property and should be essentially the same as NOLB (ie: an UP check
can cancel it). If the agent wants to stop accepting new connections
on all services, then it returns 0% and that's only propagated to
other servers via agent-track and not via track.

So at bare minimum we need to ensure there is no overlap between the
commands which change the server's health and those which change the
server's performance. I tend to think that users will either use the
agent for up/down status or for general server performance when used
in a multi-service mode, but will not mix the two because it would
otherwise be a different agent anyway. The "agent-track" feature would
provide this ability to share a server's weight while keeping local
per-service checks.

Thanks for any opinion you could share on these points.

Willy

Re: Health check hell

Reply via email to