Discussed with AlexR, and we're going to do a bit of due diligence to ensure that the changes we make now will fit with the future we envision for health checks. I'll publish a short document on this thread once we've done so.
Thanks! Greg On Thu, Oct 18, 2018 at 1:40 PM Alex Rukletsov <a...@mesosphere.com> wrote: > Why do we need to resolve this note now? It is obvious that health > interpretation must be part of the API. I'm not sure I understand what > concerns you have, Vinod. > > On Thu, Oct 18, 2018 at 8:20 PM Vinod Kone <vinodk...@apache.org> wrote: > > > I understand and am in agreement that `HealthCheckStatusInfo` will have > > more information than `CheckStatusInfo`. > > > > I would like us to put a little more thought into how that would look > like > > to be doubly sure that what we are introducing today will be evolvable > into > > that envisioned future. We have to live with API changes for a long time, > > so I would like to see more rigor here (e.g., has the note on top of the > > `HealthCheckStatusInfo` in the doc > > < > > > https://docs.google.com/document/d/1VLdaH7i7UDT3_38aOlzTOtH7lwH-laB8dCwNzte0DkU/edit#heading=h.lessdcojxc5v > > > > > has > > been discussed/resolved?) to avoid costly changes/deprecations. > > > > On Thu, Oct 18, 2018 at 4:04 AM Alex Rukletsov <a...@mesosphere.com> > > wrote: > > > > > Thanks for the thoughts, Vinod! Answers inlined. > > > > > > On Wed, Oct 17, 2018 at 8:55 PM Vinod Kone <vinodk...@apache.org> > wrote: > > > > > > > One of the things we discussed when we added `CheckInfo` and > > > > `CheckStatusInfo` was to make the older `HealthCheck` and `bool > > healthy` > > > > field (inside `TaskStatus`) consistent with the new `Check` format. > > > > > > > Correct. > > > > > > > > > > > IIRC, some of the changes we wanted to do were > > > > > > > > - Deprecate `HealthCheck` and introduce a new `HealthCheckInfo` > > proto > > > > > > > Correct. > > > > > > > - The nested messages inside `HealthCheck` (e.g., `HTTPCheckInfo`) > > > > > > should be named differently in `HealthCheckInfo` (e.g., `Http`) > > > > > > > Likely, yes. > > > > > > > - Deprecate `bool healthy` in TaskStatusInfo and introduce a new > > > > `HealthCheckStatusInfo` which looks similar to `CheckStatusInfo` > > > > > > > Correct. > > > > > > > > > > > Right now, the proposal seems to only address the last point without > > > > addressing the first two, which feels weird to me. I would prefer to > > see > > > > them addressed in one shot. > > > > > > > Can you please explain why? Is there any problem you foresee if we do > it > > > step by step? Introducing `HealthCheckStatusInfo` now solves an > important > > > problem and does not seem to introduce new issues. > > > > > > > > > > > Additionally, the proposed `HealthCheckStatusInfo` proto looks > > completely > > > > different from `CheckStatusInfo`. Is that intentional? I hope we are > > not > > > > thinking of deprecating it again when we come around to fix > > `HealthCheck` > > > > proto to be consistent with `CheckInfo` ? > > > > > > > How do you think it should look like? Why will we deprecate it? > > > > > > Health checks are different from checks in the way the result of a > check > > is > > > interpreted on the agent. In other words health check is an extra step > on > > > top of a check. We might include `CheckStatusInfo` or its contents into > > > `HealthCheckStatusInfo`, but... should we think about this now? It is > > nice > > > to have lower level info from the check in the heath status update, but > > it > > > also means more data to transfer. But interpretation—health—we > definitely > > > need. > > > > > > Greg, I'm +1 on your proposal. > > > > > > > > > > > Thanks, > > > > > > > > On Wed, Oct 17, 2018 at 1:26 PM Greg Mann <g...@mesosphere.io> > wrote: > > > > > > > > > Hi all, > > > > > Some users have recently reported issues with our current > > > implementation > > > > > of health checks. See this ticket > > > > > <https://issues.apache.org/jira/browse/MESOS-6417> for an > > introduction > > > > to > > > > > the issue. > > > > > > > > > > To summarize: we currently use a single 'optional bool healthy' > field > > > > > within the 'TaskStatus' message to indicate the result of a health > > > check. > > > > > This allows us to expose 3 health states to users: > > > > > 1) 'healthy' field is unset = no health check specified, or health > > > check > > > > > failed but grace period has not yet elapsed, or health check has > not > > > yet > > > > > been attempted > > > > > 2) 'healthy' field is set to 'false' = a health check is specified > > and > > > it > > > > > returned 'false' > > > > > 3) 'healthy' field is set to 'true' = a health check is specified > and > > > it > > > > > returned 'true' > > > > > > > > > > The issue is that some users need to distinguish between the three > > > > > scenarios in #1: no health check is specified, OR the task is not > yet > > > > > healthy but we are in the grace period. An example use case would > be > > a > > > > load > > > > > balancer which needs to wait for a healthy status to route traffic, > > but > > > > > which immediately routes traffic to tasks which have no health > check > > > > > defined. > > > > > > > > > > This issue was recognized during the design of Mesos generalized > > > checks; > > > > > for those checks, we use the presence of the 'check_status' field > to > > > > > indicate whether or not a check is defined for the task. While > > > consumers > > > > > could make use of generalized checks as a workaround, this does not > > > allow > > > > > them to both detect the presence of a check AND achieve the > > > task-killing > > > > > behavior that health checks provide. > > > > > > > > > > In order to address this, I would like to propose the following new > > > > > message, and an addition to the 'TaskStatus' message: > > > > > > > > > > message HealthCheckStatusInfo { > > > > > enum Status { > > > > > UNKNOWN = 0; > > > > > HEALTHY = 1; > > > > > UNHEALTHY = 2; > > > > > } > > > > > > > > > > required Status status = 0; > > > > > } > > > > > > > > > > message TaskStatus { > > > > > . . . > > > > > > > > > > optional HealthCheckStatusInfo health_check_status = 17; > > > > > > > > > > . . . > > > > > } > > > > > > > > > > The semantics of these fields would be as follows: > > > > > > > > > > 'health_status' field: > > > > > - If set, a health check has been set > > > > > - If unset, a health check has not been set > > > > > > > > > > 'health_status.status' field: > > > > > - UNKNOWN: The task has not become healthy but is still within its > > > grace > > > > > period (this state is also used if an internal error prevents us > from > > > > > running the health check successfully) > > > > > - HEALTHY: The health check indicates the task is healthy > > > > > - UNHEALTHY: The health check indicates the task is not healthy > > > > > > > > > > This change would also involve deprecating the existing 'healthy' > > > field. > > > > > In accordance with our deprecation policy, I believe we could not > > > remove > > > > > the deprecated field until we have a new major release (2.x). > > > > > > > > > > I'd love to hear feedback on this proposal, thanks in advance! I'll > > > also > > > > > add this as an agenda item to our upcoming API working group > meeting > > on > > > > > Tuesday, Oct. 16 at 11am PST. > > > > > > > > > > Cheers, > > > > > Greg > > > > > > > > > > > > > > >