Re: [hackathon] health checks

Andrei Dulvac Fri, 28 Sep 2018 11:29:36 -0700

Hi Jörg.

This is where systemready is a bit different:
* We have a disable config on our checks - this can definitely be improved
and maybe have that in the monitor as currently it's just a way we
implemented the individual checks.
* WARN is indeed confusing for the LB case - is the instance ready/ alive
or not? That's why we went for GREEN/ YELLOW/ RED. So for us, WARN maps to
YELLOW but the naming makes the difference clearer: YELLOW is "not ready
yet but it's a matter of time" and RED is "yeah, this isn't going to be
ready without manual intervention". "WARN" _could_ mean that, but that's
usually not what it means, at least not in any tool I've seen.


The monitoring part is something I think needs to be treaded carefully:
Yes, we can feed this into a monitoring tool, but I would not make the HCs
or systemready or whatever comes of the two a tool for providing
quantitative data, just values for binary (tertiary I guess) metrics
(qualitative info).

I agree with Christian that this might be the best opportunity to review
some of the design choices and (personal preference alert!) maybe split it
into modules with slightly different concerns. We're anyway going to have
the sling mapping to the new SPI in felix so we have backward-compatibility.

Yes, there's a tradeoff, but let's talk about it.

- Andrei

On Fri, Sep 28, 2018 at 7:40 PM Jörg Hoh <[email protected]>
wrote:

> I don't want to revive this discussion, but just wanted to give some ideas
> about my ideas when I initially started this with Bertrand (accidentally we
> did that together on an adapTo() some years ago).
>
> * the idea was always to use the healthchecks to capture the application
> state and make it usable for consumption by a loadbalancer or any other
> external monitoring system. Implementing other checks (like checks for
> security measures being implemented) are also possible, but they have never
> been the primary usecase.
> * the way how the reporting states "OK", "WARN" and "CRITICAL" are
> interpreted, is totally up to the developer implementing the healthchecks
> and the team operating the system. While "OK" and "CRITICAL" seem quite
> natural,  "WARN" is always ambigous. Coming with an "old-school" IT
> operation background, I would interprete WARN as "it's still working and
> can be used, but we should have a closer look at it". Nevertheless, the
> developer of healthchecks should have the same understanding.
> * The idea was always that it should be possible to change the settings
> during runtime manually; either to override accidentally incorrect settings
> or to handle unforseen situations; removing a misbehaving check from the
> state calculation (manually, without deployment) is definitly a usecase
> which should be supported.
>
> Jörg
>
> Am Do., 13. Sep. 2018 um 19:03 Uhr schrieb Stefan Seifert <
> [email protected]>:
>
> > - currently there is some overlap between sling health checks and the new
> > felix system readyness framework presented [1]
> > - the idea is to bring this together within felix
> > - provide a facade for the sling healthcheck API for backwards
> > compatibility
> >
> > stefan
> >
> > [1]
> >
> https://adapt.to/2018/en/schedule/system-readiness-framework-makes-deployment-automation-a-breeze.html
> >
> >
> >
>
> --
> Cheers,
> Jörg Hoh,
>
> http://cqdump.wordpress.com
> Twitter: @joerghoh
>

Re: [hackathon] health checks

Reply via email to