One of the things we discussed when we added `CheckInfo` and `CheckStatusInfo` was to make the older `HealthCheck` and `bool healthy` field (inside `TaskStatus`) consistent with the new `Check` format.
IIRC, some of the changes we wanted to do were - Deprecate `HealthCheck` and introduce a new `HealthCheckInfo` proto - The nested messages inside `HealthCheck` (e.g., `HTTPCheckInfo`) should be named differently in `HealthCheckInfo` (e.g., `Http`) - Deprecate `bool healthy` in TaskStatusInfo and introduce a new `HealthCheckStatusInfo` which looks similar to `CheckStatusInfo` Right now, the proposal seems to only address the last point without addressing the first two, which feels weird to me. I would prefer to see them addressed in one shot. Additionally, the proposed `HealthCheckStatusInfo` proto looks completely different from `CheckStatusInfo`. Is that intentional? I hope we are not thinking of deprecating it again when we come around to fix `HealthCheck` proto to be consistent with `CheckInfo` ? Thanks, On Wed, Oct 17, 2018 at 1:26 PM Greg Mann <g...@mesosphere.io> wrote: > Hi all, > Some users have recently reported issues with our current implementation > of health checks. See this ticket > <https://issues.apache.org/jira/browse/MESOS-6417> for an introduction to > the issue. > > To summarize: we currently use a single 'optional bool healthy' field > within the 'TaskStatus' message to indicate the result of a health check. > This allows us to expose 3 health states to users: > 1) 'healthy' field is unset = no health check specified, or health check > failed but grace period has not yet elapsed, or health check has not yet > been attempted > 2) 'healthy' field is set to 'false' = a health check is specified and it > returned 'false' > 3) 'healthy' field is set to 'true' = a health check is specified and it > returned 'true' > > The issue is that some users need to distinguish between the three > scenarios in #1: no health check is specified, OR the task is not yet > healthy but we are in the grace period. An example use case would be a load > balancer which needs to wait for a healthy status to route traffic, but > which immediately routes traffic to tasks which have no health check > defined. > > This issue was recognized during the design of Mesos generalized checks; > for those checks, we use the presence of the 'check_status' field to > indicate whether or not a check is defined for the task. While consumers > could make use of generalized checks as a workaround, this does not allow > them to both detect the presence of a check AND achieve the task-killing > behavior that health checks provide. > > In order to address this, I would like to propose the following new > message, and an addition to the 'TaskStatus' message: > > message HealthCheckStatusInfo { > enum Status { > UNKNOWN = 0; > HEALTHY = 1; > UNHEALTHY = 2; > } > > required Status status = 0; > } > > message TaskStatus { > . . . > > optional HealthCheckStatusInfo health_check_status = 17; > > . . . > } > > The semantics of these fields would be as follows: > > 'health_status' field: > - If set, a health check has been set > - If unset, a health check has not been set > > 'health_status.status' field: > - UNKNOWN: The task has not become healthy but is still within its grace > period (this state is also used if an internal error prevents us from > running the health check successfully) > - HEALTHY: The health check indicates the task is healthy > - UNHEALTHY: The health check indicates the task is not healthy > > This change would also involve deprecating the existing 'healthy' field. > In accordance with our deprecation policy, I believe we could not remove > the deprecated field until we have a new major release (2.x). > > I'd love to hear feedback on this proposal, thanks in advance! I'll also > add this as an agenda item to our upcoming API working group meeting on > Tuesday, Oct. 16 at 11am PST. > > Cheers, > Greg >