Discussed with AlexR, and we're going to do a bit of due diligence to
ensure that the changes we make now will fit with the future we envision
for health checks. I'll publish a short document on this thread once we've
done so.

Thanks!
Greg

On Thu, Oct 18, 2018 at 1:40 PM Alex Rukletsov <a...@mesosphere.com> wrote:

> Why do we need to resolve this note now? It is obvious that health
> interpretation must be part of the API. I'm not sure I understand what
> concerns you have, Vinod.
>
> On Thu, Oct 18, 2018 at 8:20 PM Vinod Kone <vinodk...@apache.org> wrote:
>
> > I understand and am in agreement that `HealthCheckStatusInfo` will have
> > more information than `CheckStatusInfo`.
> >
> > I would like us to put a little more thought into how that would look
> like
> > to be doubly sure that what we are introducing today will be evolvable
> into
> > that envisioned future. We have to live with API changes for a long time,
> > so I would like to see more rigor here (e.g., has the note on top of the
> > `HealthCheckStatusInfo` in the doc
> > <
> >
> https://docs.google.com/document/d/1VLdaH7i7UDT3_38aOlzTOtH7lwH-laB8dCwNzte0DkU/edit#heading=h.lessdcojxc5v
> > >
> > has
> > been discussed/resolved?) to avoid costly changes/deprecations.
> >
> > On Thu, Oct 18, 2018 at 4:04 AM Alex Rukletsov <a...@mesosphere.com>
> > wrote:
> >
> > > Thanks for the thoughts, Vinod! Answers inlined.
> > >
> > > On Wed, Oct 17, 2018 at 8:55 PM Vinod Kone <vinodk...@apache.org>
> wrote:
> > >
> > > > One of the things we discussed when we added `CheckInfo` and
> > > > `CheckStatusInfo` was to make the older `HealthCheck` and `bool
> > healthy`
> > > > field (inside `TaskStatus`) consistent with the new `Check` format.
> > > >
> > > Correct.
> > >
> > > >
> > > > IIRC, some of the changes we wanted to do were
> > > >
> > > >    - Deprecate `HealthCheck` and introduce a new `HealthCheckInfo`
> > proto
> > > >
> > > Correct.
> > >
> > > >    - The nested messages inside `HealthCheck` (e.g., `HTTPCheckInfo`)
> > >
> > >    should be named differently in `HealthCheckInfo` (e.g., `Http`)
> > > >
> > > Likely, yes.
> > >
> > > >    - Deprecate `bool healthy` in TaskStatusInfo and introduce a new
> > > >    `HealthCheckStatusInfo` which looks similar to `CheckStatusInfo`
> > > >
> > > Correct.
> > >
> > > >
> > > > Right now, the proposal seems to only address the last point without
> > > > addressing the first two, which feels weird to me. I would prefer to
> > see
> > > > them addressed in one shot.
> > > >
> > > Can you please explain why? Is there any problem you foresee if we do
> it
> > > step by step? Introducing `HealthCheckStatusInfo` now solves an
> important
> > > problem and does not seem to introduce new issues.
> > >
> > > >
> > > > Additionally, the proposed `HealthCheckStatusInfo` proto looks
> > completely
> > > > different from `CheckStatusInfo`. Is that intentional? I hope we are
> > not
> > > > thinking of deprecating it again when we come around to fix
> > `HealthCheck`
> > > > proto to be consistent with `CheckInfo` ?
> > > >
> > > How do you think it should look like? Why will we deprecate it?
> > >
> > > Health checks are different from checks in the way the result of a
> check
> > is
> > > interpreted on the agent. In other words health check is an extra step
> on
> > > top of a check. We might include `CheckStatusInfo` or its contents into
> > > `HealthCheckStatusInfo`, but... should we think about this now? It is
> > nice
> > > to have lower level info from the check in the heath status update, but
> > it
> > > also means more data to transfer. But interpretation—health—we
> definitely
> > > need.
> > >
> > > Greg, I'm +1 on your proposal.
> > >
> > > >
> > > > Thanks,
> > > >
> > > > On Wed, Oct 17, 2018 at 1:26 PM Greg Mann <g...@mesosphere.io>
> wrote:
> > > >
> > > > > Hi all,
> > > > > Some users have recently reported issues with our current
> > > implementation
> > > > > of health checks. See this ticket
> > > > > <https://issues.apache.org/jira/browse/MESOS-6417> for an
> > introduction
> > > > to
> > > > > the issue.
> > > > >
> > > > > To summarize: we currently use a single 'optional bool healthy'
> field
> > > > > within the 'TaskStatus' message to indicate the result of a health
> > > check.
> > > > > This allows us to expose 3 health states to users:
> > > > > 1) 'healthy' field is unset = no health check specified, or health
> > > check
> > > > > failed but grace period has not yet elapsed, or health check has
> not
> > > yet
> > > > > been attempted
> > > > > 2) 'healthy' field is set to 'false' = a health check is specified
> > and
> > > it
> > > > > returned 'false'
> > > > > 3) 'healthy' field is set to 'true' = a health check is specified
> and
> > > it
> > > > > returned 'true'
> > > > >
> > > > > The issue is that some users need to distinguish between the three
> > > > > scenarios in #1: no health check is specified, OR the task is not
> yet
> > > > > healthy but we are in the grace period. An example use case would
> be
> > a
> > > > load
> > > > > balancer which needs to wait for a healthy status to route traffic,
> > but
> > > > > which immediately routes traffic to tasks which have no health
> check
> > > > > defined.
> > > > >
> > > > > This issue was recognized during the design of Mesos generalized
> > > checks;
> > > > > for those checks, we use the presence of the 'check_status' field
> to
> > > > > indicate whether or not a check is defined for the task. While
> > > consumers
> > > > > could make use of generalized checks as a workaround, this does not
> > > allow
> > > > > them to both detect the presence of a check AND achieve the
> > > task-killing
> > > > > behavior that health checks provide.
> > > > >
> > > > > In order to address this, I would like to propose the following new
> > > > > message, and an addition to the 'TaskStatus' message:
> > > > >
> > > > > message HealthCheckStatusInfo {
> > > > >   enum Status {
> > > > >     UNKNOWN = 0;
> > > > >     HEALTHY = 1;
> > > > >     UNHEALTHY = 2;
> > > > >   }
> > > > >
> > > > >   required Status status = 0;
> > > > > }
> > > > >
> > > > > message TaskStatus {
> > > > >   . . .
> > > > >
> > > > >   optional HealthCheckStatusInfo health_check_status = 17;
> > > > >
> > > > >   . . .
> > > > > }
> > > > >
> > > > > The semantics of these fields would be as follows:
> > > > >
> > > > > 'health_status' field:
> > > > > - If set, a health check has been set
> > > > > - If unset, a health check has not been set
> > > > >
> > > > > 'health_status.status' field:
> > > > > - UNKNOWN: The task has not become healthy but is still within its
> > > grace
> > > > > period (this state is also used if an internal error prevents us
> from
> > > > > running the health check successfully)
> > > > > - HEALTHY: The health check indicates the task is healthy
> > > > > - UNHEALTHY: The health check indicates the task is not healthy
> > > > >
> > > > > This change would also involve deprecating the existing 'healthy'
> > > field.
> > > > > In accordance with our deprecation policy, I believe we could not
> > > remove
> > > > > the deprecated field until we have a new major release (2.x).
> > > > >
> > > > > I'd love to hear feedback on this proposal, thanks in advance! I'll
> > > also
> > > > > add this as an agenda item to our upcoming API working group
> meeting
> > on
> > > > > Tuesday, Oct. 16 at 11am PST.
> > > > >
> > > > > Cheers,
> > > > > Greg
> > > > >
> > > >
> > >
> >
>

Reply via email to