Re: [hackathon] health checks

Andrei Dulvac Mon, 24 Sep 2018 07:16:46 -0700

Hi Georg.

Please don't get this the wrong way, but I see a few issues with this
conversation:


1. I think it's easy to misinterpret what other people think and easy to
think two or twelve people agree on exactly all the aspects.

You saying
"Stefan's wording maybe wasn't perfect. But the agreement at the
Hackathon was to move Sling HC to Felix and merge useful things from
systemready in using Sling HCs as base."

confuses me as I clearly got a different message from other participants on
this thread:

Justin:

"In other words, I'm doubtful that there is an overlap here
at a framework level. What would make sense is a bridge where a subset of
health checks could be fed into the readyness framework "

Bertrand:

"I think systemready and HCs have clearly different purposes and scope,
but bridges like this (and maybe also the opposite, exposing
systemready checks as HCs) make sense."

I don't know what you discussed in the hackathon, but I'd like to see a
discussion with the interested parties here, on the list, with everyone
speaking for themselves.

What I understood from Stefan was JUST that we should try to bring them
together. And I agree!

2. I'm not sure what your proposal is *exactly*, tbh, even by this point:

"Please no bridge and no duplicate SPI interface!"
"only a temporary brid[g]e with very simple impl and a deprecated SPI. "

I assume you don't like a bridge but accept it will have to be there for
production systems. For a very long time, if it's used by "many production
systems". So for a while there will be a bridge regardless.

3. I'm not sure we want to solve the same problems. I try to argue why I
did some things in a certain way (and not use HCs) and you're trying yo
tell me "I don't think async is ideal for what you are doing at the moment"
and that  I "should" use HCs in AEM. I'd like to keep my flexibility and
room for a decision in my own work :)

Another thing I'd like to stress and put as simply as possible: If Sling
HCs move to felix, they are a NEW module, NOT compatible with Sling HCs.
Which, since it's not backward-compatible, allows it to transform or merge
with another existing tool, like systemready. Did I get something wrong
here?

Bottom line for me:
* Let's have this discussion here, everyone stating their own opinions and
facts. You have a head start after the f2f discussions, I need at least a
compacted conversation.
* My preference is to:
   ** Move Sling HCs to felix.
   ** Since it has entirely new namespaces, adapt it and merge it with
systemready, but NOT everything. Leave out the things that are outside the
scope of systemready either in the Sling layer, or an extension (bundle
depending on systemready) into felix.
   ** Since we need to keep the sling interface as is, unfortunately
maintain a bridge until we decide to break compatibility with the
production systems using it.
* I (hope that I) am listening to the arguments and I am looking forward to
bringing those two together.

Yours,
- Andrei

P.S. Sorry for top-posting. It was getting too long and a bit circular.
Feel free to revert to in-line quoting if you guys prefer it.



On Mon, Sep 24, 2018 at 3:42 PM Georg Henzler <slin...@ghenzler.de> wrote:

> Hi Christian, hi Andrei,
>
> after reading through the comments, the most important points (as a
> summary) first:
>
> * Health Checks are already used by many deployments for load balancers
> in order to not have to have to manually reconfigure LBs during
> production deployments (I will not post a list of blue chip companies in
> open source mailing list though).
>
> * I sense agreement to take Health Checks to Felix, this is good :). HCs
> are a proven technology that cover the exact same use case as
> systemready and are more mature (having been around for 5 years).
>
> * HCs today are ready to be used with Kubernetes and ootb AEM, just
> configure the HC servlet [1], and define a tag (e.g. "systemready") by
> adding it to InactiveBundlesHealthCheck and any other checks you need
> for this to it. When using a composite nodestore setup with Docker, just
> add the OSGi configs for the servlet and the configs for the "tag
> amendments" (using prop "hc.tags") to the provisioning model - done. To
> ensure you get 5x response just configure kubernetes probes [2] with
> http urls like /system/health/systemready.txt?httpStatus=CRITICAL:503
> (note that passing in query parameters for Kubernetes did not always
> work, but since 2016 it does [3])
>
> * We really have to make sure that we end up with exactly one SPI
> interface to provide checks. The current HC interfaces was discussed in
> lengths when we introduced it. There is a good reason why we don't have
> a getName() method and rather use OSGi property "hc.name"
> (reconfigurability).
>
> * Having had a close look at systemready and knowing HCs very well
> (having written a fair share of the code), I absolutely think it is
> necessary to start with the health check as a base and merge in ideas
> from systemready (and not the other way round) - this was also Justin
> Edelson's initial response to this thread.
>
> I will answer all other questions below [4].
>
> Best Regards
> Georg
>
> [1]
>
> https://sling.apache.org/documentation/bundles/sling-health-check-tool.html#health-check-servlet
> [2]
>
> https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/
> [3] https://github.com/kubernetes/kubernetes/pull/25064
>
>
> [4]
>
> > Sling health checks support the concept of tags which allows to
> > configure the special meaning of readiness and liveness as tags. So I
> > think technically the HC framework should be able to cover our case
> > too.
>
> exactly
>
> > So I would like to extend to felix systemready project to learn from
> > sling hc and add some of the features there too. I think the most
> > important thing are tags and a solid model for executors. I would be
> > happy about any help with this from the sling community side.
>
> We really have to start with the Sling HC code and merge in system ready
> aspects
>
> > Another question is if we want to add felix systemready to the sling
> > distro at some point. Would the sling community be interested in this?
>
> yes, the felix health check should be added to the distro in the same
> way as the Sling HC is today
>
> > ... there are ootb healthchecks in AEM [9] and they are NOT used, to my
> > knowledge, for the load balancer use case.
>
> you rarely run all checks, you always run the checks for a particular
> tag you are interested in
>
> >  sling HCs are used for the LB directing traffic... You say there are
> > many. Could you share some examples?
>
> Commonly used for production deployments (I work for a integration
> partner, we use this for all projects across all clients, but also
> others use it as there was many talks at conferences about it)
>
> > I understand the optimization aprt, and while systemready might clearly
> > need some optimizations, I personally don't see it as a reasonable
> > concern. Kubernetes, for example, retries the liveness and readiness
> > checks a few times before deciding to act. Do 50ms actually matter
> > here?
>
> yes, 50ms matter (I created this issue after an operations department
> refused to use this for its bad performance, since it is fixed they were
> happy :)
>
> > parallel execution is not necessarily
>
> After my 5 years long experience: parallel execution is absolutely
> necessary, otherwise response times get to long.
>
> > Async is not an issue
>
> I don't think async is ideal for what you are doing at the moment (the
> current systemready impl with default config possibly delays the correct
> result for 5 sec, this is not a good idea IMHO)
>
> >> There is no separate api bundle yet
> > True. Somebody needs to explain to me why that is a big deal for a very
> > small tool (I'm not that experienced with that matter)
>
> See SLING-6773
>
> > Yes, timeouts can make it easier to not _accidentally_ do something
> > bad, but then we're opening a new dimension of complexity. What happens
> > if a check times out?...
>
> We have discussed this in detail some years ago, we have a good solution
> (WARN by default, CRITICAL after a configurable time). Note: In the HC
> world you don't take instances offline for WARN, only for CRITICAL.
>
> >> ...developers for the platform will be confused about the two SPI
> >> interfaces HealthCheck [7] and SystemReadyCheck [8], there will be
> >> many unnecessary discussions around when to use which one
> > I don't really agree about the reasoning. If we do make a bridge, they
> > are layered and we can keep options open.
>
> Please no bridge and no duplicate SPI interface! What option would you
> keep open? I cannot think of anything. Please note that the functional
> scope of HCs are fully covered by HCs. The AEM platform has suffered
> numerous times of the "too many options problem" - I work at a service
> provider and know exactly how much time is completely wasted by people
> discussing all these different options. Please note the problem is at
> scale: It will affect thousands of developers!
>
> > But I actually agree about moving them to Felix, for slightly different
> > reasons, which is exposure and decoupling.
>
> great :)
>
> > It's being used in AEM already (alpha, beta).
>
> I think you should try using ootb health checks as described at the top
> of this email.
>
> > I respectfully disagree about the KISS part - if anything systemready
> > is KISS - as simple as possible, disregarding limitations that don't
> > matter for the single usecase it covers. But I actually agree a bridge
> > per se is not an ideal solution.
>
> Bridges are not KISS but ugly (extra code, hard to
> understand/troubleshoot, extra code/bugs). For systemready being KISS:
> yes it's easy, but it does not help being KISS while disregarding some
> important parts. HCs are KISS in a way that they solve the problem in
> the easiest possible way (I believe).
>
> > But what Stefan was saying doesn't match what you're proposing and what
> > you're proposing is not part of the -decision- consensus you reached
> > during the hackathon. Or did I misunderstand?
>
> Stefan's wording maybe wasn't perfect. But the agreement at the
> Hackathon was to move Sling HC to Felix and merge useful things from
> systemready in using Sling HCs as base.
>
> > wouldn't it make more sense to have the Sling HCs codebase *extend*
> > systemready?
>
> This won't work. The health check executor is the heart of it (with all
> the handling we've discussed) and needs to be taken as base.
>
> > there will be a bridge already between what goes into felix and the
> > Sling HCs in sling
>
> only a temporary bride with very simple impl and a deprecated SPI.
> Responsibility will be clearly moved to the felix health check module.
>
>
>
> On 2018-09-24 12:05, Christian Schneider wrote:
> > I discussed with Stefan and Georg at adaptto about sling hc and felix
> > systemready.
> >
>
> >
> > For me the main advantage of systemready being at felix is that it
> > attracts
> > a lot more people / projects than a sling subproject. People outside
> > the
> > sling community simply do not use parts of sling for other purposes.
> > One example of this is that Kai Kreuzer from Openhab approached me to
> > discuss how systemready could fit for openhab. We will also discuss
> > with
> > Peter Kriens at Eclipsecon how the aggregate state service overlaps
> > with
> > systemready.  So I think actually sling hc would have been a good case
> > for
> > bringing to felix from the start.
> >
> > So I would like to extend to felix systemready project to learn from
> > sling
> > hc and add some of the features there too. I think the most important
> > thing
> > are tags and a solid model for executors. I would be happy about any
> > help
> > with this from the sling community side.
> >
> > As some people already use sling hc with load balancers I think it also
> > makes sense to allow to reuse sling health checks in system ready.
> >
> > Another question is if we want to add felix systemready to the sling
> > distro
> > at some point. Would the sling community be interested in this?
> >
> > Christian
> >
> >
> > Am Do., 13. Sep. 2018 um 19:03 Uhr schrieb Stefan Seifert <
> > sseif...@pro-vision.de>:
> >
> >> - currently there is some overlap between sling health checks and the
> >> new
> >> felix system readyness framework presented [1]
> >> - the idea is to bring this together within felix
> >> - provide a facade for the sling healthcheck API for backwards
> >> compatibility
> >>
> >> stefan
> >>
> >> [1]
> >>
> https://adapt.to/2018/en/schedule/system-readiness-framework-makes-deployment-automation-a-breeze.html
> >>
> >>
> >>
> >
> > --
>

Re: [hackathon] health checks

Reply via email to