Re: Health check hell

Malcolm Turnbull Wed, 04 Dec 2013 13:21:24 -0800

Forgot to reply to all:

Willy,


This looks good to me and make sense.
Long term it will be more flexible this way.









On 4 December 2013 18:17, Willy Tarreau <w...@1wt.eu> wrote:
> Hi Malcolm,
>
> On Wed, Dec 04, 2013 at 03:05:41PM +0000, Malcolm Turnbull wrote:
>> Hi Willy,
>>
>> Sorry for the lack of response from the Loadbalancer.org end, I must
>> confess we were getting a bit confused by the descriptions :-).
>
> I'm not surprized! I got even more confused when trying to debug some
> of the issues Igor reported and not understanding what would act on
> what, what would be propagated from tracked servers, etc... Anyway,
> writing the design limitations here and explaining them helps us
> get rid of them.
>
>> The only thing in mu mind to be aware of is the design decision of the
>> agent to report DOWN or DRAIN on every agent request until the agent
>> starts responding with x% again..
>> Was because if you send an UP response from the agent how does the
>> agent know that HAProxy has read that value and acted on it? It would
>> need to know when it was safe to start responding with x% again?
>
> OK I get your point. My point was to emit two things at once.
> Eg: "UP 10%".
>
> We could have the agent specification state that the response format
> may include optional state words, optionally followed by a weight.
> That way we can have agents which return state only, weight only or
> both.
>
>> Our primary requirement at Loadbalancer.org is for the first scenario
>> i.e. dynamic weight adjustment and uses standard health checks:
>>
>>   - inform the load balancer about the server's load to adjust the
>>     weights, but not interact with the service's state which is
>>     monitored using regular checks. It basically replaces the job
>>     of the admin who would constantly re-adjust weights depending
>>     on the servers load.
>
> I agree that this should be by far the most common use especially in
> combination with the service check. That's the reason why I'm embarrassed
> by the fact that we put the server UP when returning a percentage because
> it means the agent returning the load has to be aware of the service state
> which is not logical.
>
>> The following usage case makes sense, but isn't really a priority for us:
>>
>>   - offer a complete health check system to services which are not
>>     easily checkable. In this case they would simply be used without
>>     a regular check. This is more a service-level approach and not
>>     a server-level one.
>
> It's not my priority either though I know some people will want it when
> they already have to use an agent and need to deploy a second script to
> check the health of a specific service : they won't find it convenient
> to run two scripts on different ports, one for the state and one for the
> load.
>
>> The third logical function for us was:
>>
>> For a Windows administrator to have a simple GUI DRAIN/HALT button in
>> the agent, to enable quick local maintenance on the Windows backend
>> server without having to log into the load balancer in order to set
>> maintenance mode.
>
> Hehe, just like the 404 feature in HTTP :-)
>
>> But again this is not really a priority with us as you say it clashes
>> with the CLI DRAIN logic....
>
> It does not exactly clash, it depends how we define it. I discovered there
> are 3 dimensions which are managed by a single agent while we initially
> thought there were only two. The agent can :
>
>     - declare a service's state (up or down)
>     - declare an administrative state (drain/ready)
>     - declare a system load (weight)
>
> But at the moment with the language we defined, each action changes two
> of them at once, which is a big problem.
>
> And depending on what system the agent will be deployed on, not all these
> features will be used together. I expect that admin state and load will be
> the more common ones for an agent. Your enumeration tends to support this.
>
> So let's try with something like this for the agent syntax :
>
>   [keywords]* [weight]
>
>   Where [keywords] are optional and made of :
>
>      "up" : report that the service is UP.
>      "down", "stopped", "fail" : report the service down with these causes
>      "drain" : don't change the state, nor the weight, just set DRAIN mode.
>      "maint" : don't change the state, nor the weight, just set MAINT mode
>      "ready" : don't change the state, nor the weight, just leave MAINT and 
> DRAIN modes.
>
>   And [weight] is optional and in the form "xxx%" to report the desired
>   weight for this server relative to the configured one in the config.
>
> Thus the following examples might illustrate it better :
>
>    "up"        : declare the server up, don't change the configured weight
>    "up 50%"    : declare the server up, set weight to 50%
>    "50%"       : don't touch the server state, just set the weight to 50%
>    "drain"     : don't touch the state, nor weight, just switch to drain mode.
>    "maint"     : force maintenance mode.
>    "drain 20%" : drain mode, adjust weight to 20% (not used in this mode but
>                  will avoid complex logics in agent scripts)
>    "ready 30%" : leave maint/drain modes, start at 30% weight.
>    "up ready 40%" : the agent does the 3 things at once and says the service 
> is OK.
>    "stopped drain 10%" : the agent does the 3 things at once and indicates 
> that the
>                          server is now down after drain mode.
>
> I remember we initially refrained from allowing the "maint" mode from the
> agent in its first version because it was planned as a regular check and
> we didn't want it to be stuck in this mode. But now that the agent runs
> on its own, it makes much more sense since it will continue to be checked.
>
> With this, we can also consider that if a regular check is configured on the
> server, then the state changes are ignored from the agent. This greatly
> simplifies deployments relying on a single agent for multiple services
> even if this agent was initially deployed for a specific service.
>
> We would have to improve the CLI and the stats interface to match that. We'd
> change the "soft stop" in the stats interface to act on the DRAIN mode instead
> of the weight. It would provide the same effect as today but in a more
> consistent way.
>
> Proceeding like this, I can easily imagine that most agents will simply
> read a small file containing the admin state (maint/drain/ready) and
> that others will only report the idle CPU measure.
>
> What do you think ?
>
> Thanks,
> Willy
>



-- 
Regards,

Malcolm Turnbull.

Loadbalancer.org Ltd.
Phone: +44 (0)870 443 8779
http://www.loadbalancer.org/

Re: Health check hell

Reply via email to