One of the things we discussed when we added `CheckInfo` and
`CheckStatusInfo` was to make the older `HealthCheck` and `bool healthy`
field (inside `TaskStatus`) consistent with the new `Check` format.

IIRC, some of the changes we wanted to do were

   - Deprecate `HealthCheck` and introduce a new `HealthCheckInfo` proto
   - The nested messages inside `HealthCheck` (e.g., `HTTPCheckInfo`)
   should be named differently in `HealthCheckInfo` (e.g., `Http`)
   - Deprecate `bool healthy` in TaskStatusInfo and introduce a new
   `HealthCheckStatusInfo` which looks similar to `CheckStatusInfo`

Right now, the proposal seems to only address the last point without
addressing the first two, which feels weird to me. I would prefer to see
them addressed in one shot.

Additionally, the proposed `HealthCheckStatusInfo` proto looks completely
different from `CheckStatusInfo`. Is that intentional? I hope we are not
thinking of deprecating it again when we come around to fix `HealthCheck`
proto to be consistent with `CheckInfo` ?

Thanks,

On Wed, Oct 17, 2018 at 1:26 PM Greg Mann <g...@mesosphere.io> wrote:

> Hi all,
> Some users have recently reported issues with our current implementation
> of health checks. See this ticket
> <https://issues.apache.org/jira/browse/MESOS-6417> for an introduction to
> the issue.
>
> To summarize: we currently use a single 'optional bool healthy' field
> within the 'TaskStatus' message to indicate the result of a health check.
> This allows us to expose 3 health states to users:
> 1) 'healthy' field is unset = no health check specified, or health check
> failed but grace period has not yet elapsed, or health check has not yet
> been attempted
> 2) 'healthy' field is set to 'false' = a health check is specified and it
> returned 'false'
> 3) 'healthy' field is set to 'true' = a health check is specified and it
> returned 'true'
>
> The issue is that some users need to distinguish between the three
> scenarios in #1: no health check is specified, OR the task is not yet
> healthy but we are in the grace period. An example use case would be a load
> balancer which needs to wait for a healthy status to route traffic, but
> which immediately routes traffic to tasks which have no health check
> defined.
>
> This issue was recognized during the design of Mesos generalized checks;
> for those checks, we use the presence of the 'check_status' field to
> indicate whether or not a check is defined for the task. While consumers
> could make use of generalized checks as a workaround, this does not allow
> them to both detect the presence of a check AND achieve the task-killing
> behavior that health checks provide.
>
> In order to address this, I would like to propose the following new
> message, and an addition to the 'TaskStatus' message:
>
> message HealthCheckStatusInfo {
>   enum Status {
>     UNKNOWN = 0;
>     HEALTHY = 1;
>     UNHEALTHY = 2;
>   }
>
>   required Status status = 0;
> }
>
> message TaskStatus {
>   . . .
>
>   optional HealthCheckStatusInfo health_check_status = 17;
>
>   . . .
> }
>
> The semantics of these fields would be as follows:
>
> 'health_status' field:
> - If set, a health check has been set
> - If unset, a health check has not been set
>
> 'health_status.status' field:
> - UNKNOWN: The task has not become healthy but is still within its grace
> period (this state is also used if an internal error prevents us from
> running the health check successfully)
> - HEALTHY: The health check indicates the task is healthy
> - UNHEALTHY: The health check indicates the task is not healthy
>
> This change would also involve deprecating the existing 'healthy' field.
> In accordance with our deprecation policy, I believe we could not remove
> the deprecated field until we have a new major release (2.x).
>
> I'd love to hear feedback on this proposal, thanks in advance! I'll also
> add this as an agenda item to our upcoming API working group meeting on
> Tuesday, Oct. 16 at 11am PST.
>
> Cheers,
> Greg
>

Reply via email to