Re: [hackathon] health checks

Christian Schneider Thu, 27 Sep 2018 06:37:37 -0700

I agree we should offer to move health checks to felix.
As this will be a breaking change anyway I propose we take the chance to
review the API and impl and eventually do changes before a first release at
felix.


Christian

Am Mo., 24. Sep. 2018 um 18:05 Uhr schrieb Georg Henzler <
[email protected]>:

>
> Hi Andrei,
>
> > You saying
> > "Stefan's wording maybe wasn't perfect. But the agreement at the
> > Hackathon was to move Sling HC to Felix and merge useful things from
> > systemready in using Sling HCs as base."
>
> exactly. For the people that attended: Please comment if i’m wrong
> (Bertrand and Justin not there unfortunately). If you read Justins second
> email to my response you see that he agrees to my proposal.
>
> > 2. I'm not sure what your proposal is *exactly*, tbh, even by this point:
> >
> > "Please no bridge and no duplicate SPI interface!"
> > "only a temporary brid[g]e with very simple impl and a deprecated SPI. "
>
> exactly one SPI interface withe the same OSGi properties and only changed
> package names. Migrating will be a matter of changing the imports and
> exchanging the maven dependency (migration path needs to be easy, there
> might be around 5000 custom check out there in the wild, just a rough
> estimation)
>
> the bridge (sling to felix HC) is minimal (as exactly the same checks are
> used) snd temporary.
>
> > I assume you don't like a bridge but accept it will have to be there for
> > production systems.
>
> there is a difference between a bridge to support an deprecated interface
> and a bridge between two SPIs that are both not deprecated and just cause
> confusion
>
> >
> > 3. I'm not sure we want to solve the same problems. I try to argue why I
> > did some things in a certain way (and not use HCs) and you're trying yo
> > tell me "I don't think async is ideal for what you are doing at the
> moment"
> > and that  I "should" use HCs in AEM. I'd like to keep my flexibility and
> > room for a decision in my own work :)
> >
>
> you have full flexibility when using the health check SPI interface, if
> you really find something that is missing we can add it (but I challenge
> you to find something ;-) )
>
> > Another thing I'd like to stress and put as simply as possible: If Sling
> > HCs move to felix, they are a NEW module, NOT compatible with Sling HCs.
>
> see above, simple migration path
>
> >
> >
> >   ** Move Sling HCs to felix.
> >   ** Since it has entirely new namespaces, adapt it and merge it with
> > systemready, but NOT everything.
>
> for the manual use case there is nothing special in sling (since it’s
> trivial to implement, it’s pretty much just the SPI interface and maybe the
> CompositeHealthCheck). Really most code is exactly to make the LB use case
> work smoothly.
>
> So we should move at least most of it.
>
> -Georg
>
>
> >
> >
> >> On Mon, Sep 24, 2018 at 3:42 PM Georg Henzler <[email protected]>
> wrote:
> >>
> >> Hi Christian, hi Andrei,
> >>
> >> after reading through the comments, the most important points (as a
> >> summary) first:
> >>
> >> * Health Checks are already used by many deployments for load balancers
> >> in order to not have to have to manually reconfigure LBs during
> >> production deployments (I will not post a list of blue chip companies in
> >> open source mailing list though).
> >>
> >> * I sense agreement to take Health Checks to Felix, this is good :). HCs
> >> are a proven technology that cover the exact same use case as
> >> systemready and are more mature (having been around for 5 years).
> >>
> >> * HCs today are ready to be used with Kubernetes and ootb AEM, just
> >> configure the HC servlet [1], and define a tag (e.g. "systemready") by
> >> adding it to InactiveBundlesHealthCheck and any other checks you need
> >> for this to it. When using a composite nodestore setup with Docker, just
> >> add the OSGi configs for the servlet and the configs for the "tag
> >> amendments" (using prop "hc.tags") to the provisioning model - done. To
> >> ensure you get 5x response just configure kubernetes probes [2] with
> >> http urls like /system/health/systemready.txt?httpStatus=CRITICAL:503
> >> (note that passing in query parameters for Kubernetes did not always
> >> work, but since 2016 it does [3])
> >>
> >> * We really have to make sure that we end up with exactly one SPI
> >> interface to provide checks. The current HC interfaces was discussed in
> >> lengths when we introduced it. There is a good reason why we don't have
> >> a getName() method and rather use OSGi property "hc.name"
> >> (reconfigurability).
> >>
> >> * Having had a close look at systemready and knowing HCs very well
> >> (having written a fair share of the code), I absolutely think it is
> >> necessary to start with the health check as a base and merge in ideas
> >> from systemready (and not the other way round) - this was also Justin
> >> Edelson's initial response to this thread.
> >>
> >> I will answer all other questions below [4].
> >>
> >> Best Regards
> >> Georg
> >>
> >> [1]
> >>
> >>
> https://sling.apache.org/documentation/bundles/sling-health-check-tool.html#health-check-servlet
> >> [2]
> >>
> >>
> https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/
> >> [3] https://github.com/kubernetes/kubernetes/pull/25064
> >>
> >>
> >> [4]
> >>
> >>> Sling health checks support the concept of tags which allows to
> >>> configure the special meaning of readiness and liveness as tags. So I
> >>> think technically the HC framework should be able to cover our case
> >>> too.
> >>
> >> exactly
> >>
> >>> So I would like to extend to felix systemready project to learn from
> >>> sling hc and add some of the features there too. I think the most
> >>> important thing are tags and a solid model for executors. I would be
> >>> happy about any help with this from the sling community side.
> >>
> >> We really have to start with the Sling HC code and merge in system ready
> >> aspects
> >>
> >>> Another question is if we want to add felix systemready to the sling
> >>> distro at some point. Would the sling community be interested in this?
> >>
> >> yes, the felix health check should be added to the distro in the same
> >> way as the Sling HC is today
> >>
> >>> ... there are ootb healthchecks in AEM [9] and they are NOT used, to my
> >>> knowledge, for the load balancer use case.
> >>
> >> you rarely run all checks, you always run the checks for a particular
> >> tag you are interested in
> >>
> >>> sling HCs are used for the LB directing traffic... You say there are
> >>> many. Could you share some examples?
> >>
> >> Commonly used for production deployments (I work for a integration
> >> partner, we use this for all projects across all clients, but also
> >> others use it as there was many talks at conferences about it)
> >>
> >>> I understand the optimization aprt, and while systemready might clearly
> >>> need some optimizations, I personally don't see it as a reasonable
> >>> concern. Kubernetes, for example, retries the liveness and readiness
> >>> checks a few times before deciding to act. Do 50ms actually matter
> >>> here?
> >>
> >> yes, 50ms matter (I created this issue after an operations department
> >> refused to use this for its bad performance, since it is fixed they were
> >> happy :)
> >>
> >>> parallel execution is not necessarily
> >>
> >> After my 5 years long experience: parallel execution is absolutely
> >> necessary, otherwise response times get to long.
> >>
> >>> Async is not an issue
> >>
> >> I don't think async is ideal for what you are doing at the moment (the
> >> current systemready impl with default config possibly delays the correct
> >> result for 5 sec, this is not a good idea IMHO)
> >>
> >>>> There is no separate api bundle yet
> >>> True. Somebody needs to explain to me why that is a big deal for a very
> >>> small tool (I'm not that experienced with that matter)
> >>
> >> See SLING-6773
> >>
> >>> Yes, timeouts can make it easier to not _accidentally_ do something
> >>> bad, but then we're opening a new dimension of complexity. What happens
> >>> if a check times out?...
> >>
> >> We have discussed this in detail some years ago, we have a good solution
> >> (WARN by default, CRITICAL after a configurable time). Note: In the HC
> >> world you don't take instances offline for WARN, only for CRITICAL.
> >>
> >>>> ...developers for the platform will be confused about the two SPI
> >>>> interfaces HealthCheck [7] and SystemReadyCheck [8], there will be
> >>>> many unnecessary discussions around when to use which one
> >>> I don't really agree about the reasoning. If we do make a bridge, they
> >>> are layered and we can keep options open.
> >>
> >> Please no bridge and no duplicate SPI interface! What option would you
> >> keep open? I cannot think of anything. Please note that the functional
> >> scope of HCs are fully covered by HCs. The AEM platform has suffered
> >> numerous times of the "too many options problem" - I work at a service
> >> provider and know exactly how much time is completely wasted by people
> >> discussing all these different options. Please note the problem is at
> >> scale: It will affect thousands of developers!
> >>
> >>> But I actually agree about moving them to Felix, for slightly different
> >>> reasons, which is exposure and decoupling.
> >>
> >> great :)
> >>
> >>> It's being used in AEM already (alpha, beta).
> >>
> >> I think you should try using ootb health checks as described at the top
> >> of this email.
> >>
> >>> I respectfully disagree about the KISS part - if anything systemready
> >>> is KISS - as simple as possible, disregarding limitations that don't
> >>> matter for the single usecase it covers. But I actually agree a bridge
> >>> per se is not an ideal solution.
> >>
> >> Bridges are not KISS but ugly (extra code, hard to
> >> understand/troubleshoot, extra code/bugs). For systemready being KISS:
> >> yes it's easy, but it does not help being KISS while disregarding some
> >> important parts. HCs are KISS in a way that they solve the problem in
> >> the easiest possible way (I believe).
> >>
> >>> But what Stefan was saying doesn't match what you're proposing and what
> >>> you're proposing is not part of the -decision- consensus you reached
> >>> during the hackathon. Or did I misunderstand?
> >>
> >> Stefan's wording maybe wasn't perfect. But the agreement at the
> >> Hackathon was to move Sling HC to Felix and merge useful things from
> >> systemready in using Sling HCs as base.
> >>
> >>> wouldn't it make more sense to have the Sling HCs codebase *extend*
> >>> systemready?
> >>
> >> This won't work. The health check executor is the heart of it (with all
> >> the handling we've discussed) and needs to be taken as base.
> >>
> >>> there will be a bridge already between what goes into felix and the
> >>> Sling HCs in sling
> >>
> >> only a temporary bride with very simple impl and a deprecated SPI.
> >> Responsibility will be clearly moved to the felix health check module.
> >>
> >>
> >>
> >>> On 2018-09-24 12:05, Christian Schneider wrote:
> >>> I discussed with Stefan and Georg at adaptto about sling hc and felix
> >>> systemready.
> >>>
> >>
> >>>
> >>> For me the main advantage of systemready being at felix is that it
> >>> attracts
> >>> a lot more people / projects than a sling subproject. People outside
> >>> the
> >>> sling community simply do not use parts of sling for other purposes.
> >>> One example of this is that Kai Kreuzer from Openhab approached me to
> >>> discuss how systemready could fit for openhab. We will also discuss
> >>> with
> >>> Peter Kriens at Eclipsecon how the aggregate state service overlaps
> >>> with
> >>> systemready.  So I think actually sling hc would have been a good case
> >>> for
> >>> bringing to felix from the start.
> >>>
> >>> So I would like to extend to felix systemready project to learn from
> >>> sling
> >>> hc and add some of the features there too. I think the most important
> >>> thing
> >>> are tags and a solid model for executors. I would be happy about any
> >>> help
> >>> with this from the sling community side.
> >>>
> >>> As some people already use sling hc with load balancers I think it also
> >>> makes sense to allow to reuse sling health checks in system ready.
> >>>
> >>> Another question is if we want to add felix systemready to the sling
> >>> distro
> >>> at some point. Would the sling community be interested in this?
> >>>
> >>> Christian
> >>>
> >>>
> >>> Am Do., 13. Sep. 2018 um 19:03 Uhr schrieb Stefan Seifert <
> >>> [email protected]>:
> >>>
> >>>> - currently there is some overlap between sling health checks and the
> >>>> new
> >>>> felix system readyness framework presented [1]
> >>>> - the idea is to bring this together within felix
> >>>> - provide a facade for the sling healthcheck API for backwards
> >>>> compatibility
> >>>>
> >>>> stefan
> >>>>
> >>>> [1]
> >>>>
> >>
> https://adapt.to/2018/en/schedule/system-readiness-framework-makes-deployment-automation-a-breeze.html
> >>>>
> >>>>
> >>>>
> >>>
> >>> --
> >>
>
>

-- 
-- 
Christian Schneider
http://www.liquid-reality.de

Computer Scientist
http://www.adobe.com

Re: [hackathon] health checks

Reply via email to