I agree we should offer to move health checks to felix. As this will be a breaking change anyway I propose we take the chance to review the API and impl and eventually do changes before a first release at felix.
Christian Am Mo., 24. Sep. 2018 um 18:05 Uhr schrieb Georg Henzler < slin...@ghenzler.de>: > > Hi Andrei, > > > You saying > > "Stefan's wording maybe wasn't perfect. But the agreement at the > > Hackathon was to move Sling HC to Felix and merge useful things from > > systemready in using Sling HCs as base." > > exactly. For the people that attended: Please comment if i’m wrong > (Bertrand and Justin not there unfortunately). If you read Justins second > email to my response you see that he agrees to my proposal. > > > 2. I'm not sure what your proposal is *exactly*, tbh, even by this point: > > > > "Please no bridge and no duplicate SPI interface!" > > "only a temporary brid[g]e with very simple impl and a deprecated SPI. " > > exactly one SPI interface withe the same OSGi properties and only changed > package names. Migrating will be a matter of changing the imports and > exchanging the maven dependency (migration path needs to be easy, there > might be around 5000 custom check out there in the wild, just a rough > estimation) > > the bridge (sling to felix HC) is minimal (as exactly the same checks are > used) snd temporary. > > > I assume you don't like a bridge but accept it will have to be there for > > production systems. > > there is a difference between a bridge to support an deprecated interface > and a bridge between two SPIs that are both not deprecated and just cause > confusion > > > > > 3. I'm not sure we want to solve the same problems. I try to argue why I > > did some things in a certain way (and not use HCs) and you're trying yo > > tell me "I don't think async is ideal for what you are doing at the > moment" > > and that I "should" use HCs in AEM. I'd like to keep my flexibility and > > room for a decision in my own work :) > > > > you have full flexibility when using the health check SPI interface, if > you really find something that is missing we can add it (but I challenge > you to find something ;-) ) > > > Another thing I'd like to stress and put as simply as possible: If Sling > > HCs move to felix, they are a NEW module, NOT compatible with Sling HCs. > > see above, simple migration path > > > > > > > ** Move Sling HCs to felix. > > ** Since it has entirely new namespaces, adapt it and merge it with > > systemready, but NOT everything. > > for the manual use case there is nothing special in sling (since it’s > trivial to implement, it’s pretty much just the SPI interface and maybe the > CompositeHealthCheck). Really most code is exactly to make the LB use case > work smoothly. > > So we should move at least most of it. > > -Georg > > > > > > > >> On Mon, Sep 24, 2018 at 3:42 PM Georg Henzler <slin...@ghenzler.de> > wrote: > >> > >> Hi Christian, hi Andrei, > >> > >> after reading through the comments, the most important points (as a > >> summary) first: > >> > >> * Health Checks are already used by many deployments for load balancers > >> in order to not have to have to manually reconfigure LBs during > >> production deployments (I will not post a list of blue chip companies in > >> open source mailing list though). > >> > >> * I sense agreement to take Health Checks to Felix, this is good :). HCs > >> are a proven technology that cover the exact same use case as > >> systemready and are more mature (having been around for 5 years). > >> > >> * HCs today are ready to be used with Kubernetes and ootb AEM, just > >> configure the HC servlet [1], and define a tag (e.g. "systemready") by > >> adding it to InactiveBundlesHealthCheck and any other checks you need > >> for this to it. When using a composite nodestore setup with Docker, just > >> add the OSGi configs for the servlet and the configs for the "tag > >> amendments" (using prop "hc.tags") to the provisioning model - done. To > >> ensure you get 5x response just configure kubernetes probes [2] with > >> http urls like /system/health/systemready.txt?httpStatus=CRITICAL:503 > >> (note that passing in query parameters for Kubernetes did not always > >> work, but since 2016 it does [3]) > >> > >> * We really have to make sure that we end up with exactly one SPI > >> interface to provide checks. The current HC interfaces was discussed in > >> lengths when we introduced it. There is a good reason why we don't have > >> a getName() method and rather use OSGi property "hc.name" > >> (reconfigurability). > >> > >> * Having had a close look at systemready and knowing HCs very well > >> (having written a fair share of the code), I absolutely think it is > >> necessary to start with the health check as a base and merge in ideas > >> from systemready (and not the other way round) - this was also Justin > >> Edelson's initial response to this thread. > >> > >> I will answer all other questions below [4]. > >> > >> Best Regards > >> Georg > >> > >> [1] > >> > >> > https://sling.apache.org/documentation/bundles/sling-health-check-tool.html#health-check-servlet > >> [2] > >> > >> > https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/ > >> [3] https://github.com/kubernetes/kubernetes/pull/25064 > >> > >> > >> [4] > >> > >>> Sling health checks support the concept of tags which allows to > >>> configure the special meaning of readiness and liveness as tags. So I > >>> think technically the HC framework should be able to cover our case > >>> too. > >> > >> exactly > >> > >>> So I would like to extend to felix systemready project to learn from > >>> sling hc and add some of the features there too. I think the most > >>> important thing are tags and a solid model for executors. I would be > >>> happy about any help with this from the sling community side. > >> > >> We really have to start with the Sling HC code and merge in system ready > >> aspects > >> > >>> Another question is if we want to add felix systemready to the sling > >>> distro at some point. Would the sling community be interested in this? > >> > >> yes, the felix health check should be added to the distro in the same > >> way as the Sling HC is today > >> > >>> ... there are ootb healthchecks in AEM [9] and they are NOT used, to my > >>> knowledge, for the load balancer use case. > >> > >> you rarely run all checks, you always run the checks for a particular > >> tag you are interested in > >> > >>> sling HCs are used for the LB directing traffic... You say there are > >>> many. Could you share some examples? > >> > >> Commonly used for production deployments (I work for a integration > >> partner, we use this for all projects across all clients, but also > >> others use it as there was many talks at conferences about it) > >> > >>> I understand the optimization aprt, and while systemready might clearly > >>> need some optimizations, I personally don't see it as a reasonable > >>> concern. Kubernetes, for example, retries the liveness and readiness > >>> checks a few times before deciding to act. Do 50ms actually matter > >>> here? > >> > >> yes, 50ms matter (I created this issue after an operations department > >> refused to use this for its bad performance, since it is fixed they were > >> happy :) > >> > >>> parallel execution is not necessarily > >> > >> After my 5 years long experience: parallel execution is absolutely > >> necessary, otherwise response times get to long. > >> > >>> Async is not an issue > >> > >> I don't think async is ideal for what you are doing at the moment (the > >> current systemready impl with default config possibly delays the correct > >> result for 5 sec, this is not a good idea IMHO) > >> > >>>> There is no separate api bundle yet > >>> True. Somebody needs to explain to me why that is a big deal for a very > >>> small tool (I'm not that experienced with that matter) > >> > >> See SLING-6773 > >> > >>> Yes, timeouts can make it easier to not _accidentally_ do something > >>> bad, but then we're opening a new dimension of complexity. What happens > >>> if a check times out?... > >> > >> We have discussed this in detail some years ago, we have a good solution > >> (WARN by default, CRITICAL after a configurable time). Note: In the HC > >> world you don't take instances offline for WARN, only for CRITICAL. > >> > >>>> ...developers for the platform will be confused about the two SPI > >>>> interfaces HealthCheck [7] and SystemReadyCheck [8], there will be > >>>> many unnecessary discussions around when to use which one > >>> I don't really agree about the reasoning. If we do make a bridge, they > >>> are layered and we can keep options open. > >> > >> Please no bridge and no duplicate SPI interface! What option would you > >> keep open? I cannot think of anything. Please note that the functional > >> scope of HCs are fully covered by HCs. The AEM platform has suffered > >> numerous times of the "too many options problem" - I work at a service > >> provider and know exactly how much time is completely wasted by people > >> discussing all these different options. Please note the problem is at > >> scale: It will affect thousands of developers! > >> > >>> But I actually agree about moving them to Felix, for slightly different > >>> reasons, which is exposure and decoupling. > >> > >> great :) > >> > >>> It's being used in AEM already (alpha, beta). > >> > >> I think you should try using ootb health checks as described at the top > >> of this email. > >> > >>> I respectfully disagree about the KISS part - if anything systemready > >>> is KISS - as simple as possible, disregarding limitations that don't > >>> matter for the single usecase it covers. But I actually agree a bridge > >>> per se is not an ideal solution. > >> > >> Bridges are not KISS but ugly (extra code, hard to > >> understand/troubleshoot, extra code/bugs). For systemready being KISS: > >> yes it's easy, but it does not help being KISS while disregarding some > >> important parts. HCs are KISS in a way that they solve the problem in > >> the easiest possible way (I believe). > >> > >>> But what Stefan was saying doesn't match what you're proposing and what > >>> you're proposing is not part of the -decision- consensus you reached > >>> during the hackathon. Or did I misunderstand? > >> > >> Stefan's wording maybe wasn't perfect. But the agreement at the > >> Hackathon was to move Sling HC to Felix and merge useful things from > >> systemready in using Sling HCs as base. > >> > >>> wouldn't it make more sense to have the Sling HCs codebase *extend* > >>> systemready? > >> > >> This won't work. The health check executor is the heart of it (with all > >> the handling we've discussed) and needs to be taken as base. > >> > >>> there will be a bridge already between what goes into felix and the > >>> Sling HCs in sling > >> > >> only a temporary bride with very simple impl and a deprecated SPI. > >> Responsibility will be clearly moved to the felix health check module. > >> > >> > >> > >>> On 2018-09-24 12:05, Christian Schneider wrote: > >>> I discussed with Stefan and Georg at adaptto about sling hc and felix > >>> systemready. > >>> > >> > >>> > >>> For me the main advantage of systemready being at felix is that it > >>> attracts > >>> a lot more people / projects than a sling subproject. People outside > >>> the > >>> sling community simply do not use parts of sling for other purposes. > >>> One example of this is that Kai Kreuzer from Openhab approached me to > >>> discuss how systemready could fit for openhab. We will also discuss > >>> with > >>> Peter Kriens at Eclipsecon how the aggregate state service overlaps > >>> with > >>> systemready. So I think actually sling hc would have been a good case > >>> for > >>> bringing to felix from the start. > >>> > >>> So I would like to extend to felix systemready project to learn from > >>> sling > >>> hc and add some of the features there too. I think the most important > >>> thing > >>> are tags and a solid model for executors. I would be happy about any > >>> help > >>> with this from the sling community side. > >>> > >>> As some people already use sling hc with load balancers I think it also > >>> makes sense to allow to reuse sling health checks in system ready. > >>> > >>> Another question is if we want to add felix systemready to the sling > >>> distro > >>> at some point. Would the sling community be interested in this? > >>> > >>> Christian > >>> > >>> > >>> Am Do., 13. Sep. 2018 um 19:03 Uhr schrieb Stefan Seifert < > >>> sseif...@pro-vision.de>: > >>> > >>>> - currently there is some overlap between sling health checks and the > >>>> new > >>>> felix system readyness framework presented [1] > >>>> - the idea is to bring this together within felix > >>>> - provide a facade for the sling healthcheck API for backwards > >>>> compatibility > >>>> > >>>> stefan > >>>> > >>>> [1] > >>>> > >> > https://adapt.to/2018/en/schedule/system-readiness-framework-makes-deployment-automation-a-breeze.html > >>>> > >>>> > >>>> > >>> > >>> -- > >> > > -- -- Christian Schneider http://www.liquid-reality.de Computer Scientist http://www.adobe.com