Health check hell

Willy Tarreau Thu, 28 Nov 2013 06:42:27 -0800

Hi guys,

I'm CCing the persons who've been most involved in the evolutions of the
health check system and who might have strong opinions about what to take
care of.


The recent inclusion of the agent-check has unveiled how much the current
check subsystem is a complex mess full of corner cases. Igor sent me some
screen captures of abnormal stats pages with servers marked DRAIN in brown
while they were just set to weight 0 in the config, etc... The reason is
the ambiguity we have in defining states because we have added more and
more exceptions and some combinations are not properly documented. Thus
I'm proposing to perform some changes to remain compatible with what we
had till now and ensure that both the agent and the CLI work in a coherent
and understandable way.

First, brief summary of the current situation :

  - MAINT state has the highest precedence. It can be enforced from the CLI.
    A server can be turned from any state to MAINT and the stats page will
    report MAINT. Checks are automatically disabled in this state. This
    state may be inherited from tracked servers. The stats page reports such
    servers in brown whatever their previous check results. Technically,
    this state is represented as a flag on the server which is checked before
    everything else.

  - DRAIN is the state where either the user (via the config or CLI) or the
    agent explicitly forces the server weight to zero. It has the second
    highest precedence since it can be enforced from the CLI and is persistent.
    This state should appear only on servers which are technically UP, so they
    can still receive some traffic. In practice we don't need to "store" the
    DRAIN state, a server should be considered in this state when it's UP (or
    unchecked) and its weight is zero. It's important to keep a special color
    for this case (currently blue) on the stats page for this. Writing "DRAIN"
    instead of "UP" also helps spotting it.

  - UNCHECKED is the state where the server is enabled and never performs
    any health checks. It's not in DRAIN state either when reported in this
    state. That's the gray state on the stats page.

  - NOLB cannot be forced from the CLI nor the agent. It's equivalent to a
    DRAIN mode except that it is deduced from the result of a health check
    ("404") and does not affect the weight. It is maintained until the check
    reports a different state, or until the server goes down, where it
    automatically clears. It may be inherited from tracked servers. It is only
    used in HTTP mode with "http-check disable-on-404" at the moment.

  - UP is the state where the server is consistently seen as OK without any of
    the exceptions above. This state is altered by health checks. The agent
    might switch away from it, until a new check changes this. The CLI must
    provide the ability to do the same. The agent can currently force the
    server to be seen up by emitting a weighted percentage.

  - UP/GOINGDOWN is the state where the server was previously seen as OK but
    recently failed less than "fail" checks. It's without any of the exceptions
    above.

  - DOWN is the state where the server is consistently seen as KO without any
    of the exceptions above. The agent must be able to temporarily force the
    server into this state until next health check might change it again. The
    CLI must provide the ability to do the same.

  - DOWN/GOINGUP is the state where the server was previously seen as KO but
    recently succeeded less than "rise" checks. It's without any of the
    exceptions above.

Right now MAINT state is propagated from tracked servers, NOLB is propagated
as well, but not DRAIN. Changing a server's weight does not affect the tracking
servers' weight, and it definitely must not.

At the moment, only checked servers may be tracked, but since we can now
enable/disable a server, it would make sense to allow a server to track
unchecked servers as well so that a single "enable" or "disable" applies
to the whole list of trackers.

Right now what is propagated across tracked servers is :
  - MAINT
  - NOLB
  - UP/DOWN and DOWN/UP transitions

We should consider that the agent provides exactly the same capabilities as the
CLI, because it is used to alter the server's behaviour beyond what the config
plans, exactly as the CLI does. This means several things :

  - weights are per-server, so a weight change learned from an agent is not
    propagated to tracking servers.

  - CLI needs the ability to set a server up or down just like the agent. This
    is currently not possible.

  - CLI's set weight does not turn the server up while agent's weight turns it
    on, I think we need to align the agent on the CLI here.

  - we'll later have to add a new directive "agent-track", comparable to "track"
    to propagate agent changes to tracking servers.

  - CLI always has the final word because from the CLI we can disable the agent.

The "DRAIN" state is very similar to the NOLB state except that it explicitly
forces the weight to zero, causing the loss of the previous weight.

So probably we should change a few things in the agent :
  - have the weight announces not change the server's state, just the weight,
    just like the CLI. This is useful to announce the server's load only
    without interfering with checks ;

  - have DRAIN and NOLB be exactly the same thing. That means that an agent
    responds DRAIN when it just wants the server not to receive new connections
    regardless of its operating state. This state will be ignored when the
    server is already down, and DOWN will follow.
  - support an "up" command to immediately turn the server up and reverse the
    effects of "down", allowing it to run without health checks and just the
    agent.

Then these changes will follow for the CLI :

  - the CLI must gain support for setting the NOLB/DRAIN state.

  - the CLI must also support "set server xxx up/down".

We'd report in blue on the stats page servers that are either in NOLB state or
that have a weight set to zero, as it has been done till now.

What do you think ? I'm willing to perform the changes but I want to be sure
that it will match what users expect, especially for the agent string format.

Thanks,
Willy

Health check hell

Reply via email to