Hi guys, I'm CCing the persons who've been most involved in the evolutions of the health check system and who might have strong opinions about what to take care of.
The recent inclusion of the agent-check has unveiled how much the current check subsystem is a complex mess full of corner cases. Igor sent me some screen captures of abnormal stats pages with servers marked DRAIN in brown while they were just set to weight 0 in the config, etc... The reason is the ambiguity we have in defining states because we have added more and more exceptions and some combinations are not properly documented. Thus I'm proposing to perform some changes to remain compatible with what we had till now and ensure that both the agent and the CLI work in a coherent and understandable way. First, brief summary of the current situation : - MAINT state has the highest precedence. It can be enforced from the CLI. A server can be turned from any state to MAINT and the stats page will report MAINT. Checks are automatically disabled in this state. This state may be inherited from tracked servers. The stats page reports such servers in brown whatever their previous check results. Technically, this state is represented as a flag on the server which is checked before everything else. - DRAIN is the state where either the user (via the config or CLI) or the agent explicitly forces the server weight to zero. It has the second highest precedence since it can be enforced from the CLI and is persistent. This state should appear only on servers which are technically UP, so they can still receive some traffic. In practice we don't need to "store" the DRAIN state, a server should be considered in this state when it's UP (or unchecked) and its weight is zero. It's important to keep a special color for this case (currently blue) on the stats page for this. Writing "DRAIN" instead of "UP" also helps spotting it. - UNCHECKED is the state where the server is enabled and never performs any health checks. It's not in DRAIN state either when reported in this state. That's the gray state on the stats page. - NOLB cannot be forced from the CLI nor the agent. It's equivalent to a DRAIN mode except that it is deduced from the result of a health check ("404") and does not affect the weight. It is maintained until the check reports a different state, or until the server goes down, where it automatically clears. It may be inherited from tracked servers. It is only used in HTTP mode with "http-check disable-on-404" at the moment. - UP is the state where the server is consistently seen as OK without any of the exceptions above. This state is altered by health checks. The agent might switch away from it, until a new check changes this. The CLI must provide the ability to do the same. The agent can currently force the server to be seen up by emitting a weighted percentage. - UP/GOINGDOWN is the state where the server was previously seen as OK but recently failed less than "fail" checks. It's without any of the exceptions above. - DOWN is the state where the server is consistently seen as KO without any of the exceptions above. The agent must be able to temporarily force the server into this state until next health check might change it again. The CLI must provide the ability to do the same. - DOWN/GOINGUP is the state where the server was previously seen as KO but recently succeeded less than "rise" checks. It's without any of the exceptions above. Right now MAINT state is propagated from tracked servers, NOLB is propagated as well, but not DRAIN. Changing a server's weight does not affect the tracking servers' weight, and it definitely must not. At the moment, only checked servers may be tracked, but since we can now enable/disable a server, it would make sense to allow a server to track unchecked servers as well so that a single "enable" or "disable" applies to the whole list of trackers. Right now what is propagated across tracked servers is : - MAINT - NOLB - UP/DOWN and DOWN/UP transitions We should consider that the agent provides exactly the same capabilities as the CLI, because it is used to alter the server's behaviour beyond what the config plans, exactly as the CLI does. This means several things : - weights are per-server, so a weight change learned from an agent is not propagated to tracking servers. - CLI needs the ability to set a server up or down just like the agent. This is currently not possible. - CLI's set weight does not turn the server up while agent's weight turns it on, I think we need to align the agent on the CLI here. - we'll later have to add a new directive "agent-track", comparable to "track" to propagate agent changes to tracking servers. - CLI always has the final word because from the CLI we can disable the agent. The "DRAIN" state is very similar to the NOLB state except that it explicitly forces the weight to zero, causing the loss of the previous weight. So probably we should change a few things in the agent : - have the weight announces not change the server's state, just the weight, just like the CLI. This is useful to announce the server's load only without interfering with checks ; - have DRAIN and NOLB be exactly the same thing. That means that an agent responds DRAIN when it just wants the server not to receive new connections regardless of its operating state. This state will be ignored when the server is already down, and DOWN will follow. - support an "up" command to immediately turn the server up and reverse the effects of "down", allowing it to run without health checks and just the agent. Then these changes will follow for the CLI : - the CLI must gain support for setting the NOLB/DRAIN state. - the CLI must also support "set server xxx up/down". We'd report in blue on the stats page servers that are either in NOLB state or that have a weight set to zero, as it has been done till now. What do you think ? I'm willing to perform the changes but I want to be sure that it will match what users expect, especially for the agent string format. Thanks, Willy