Joerg Hoh created FELIX-6663:
--------------------------------

             Summary: Warn if healthcheck execution takes too long
                 Key: FELIX-6663
                 URL: https://issues.apache.org/jira/browse/FELIX-6663
             Project: Felix
          Issue Type: Task
          Components: Health Checks
    Affects Versions: healthcheck.core 2.2.0
            Reporter: Joerg Hoh


We monitor our system using Felix Healthchecks and require that some 
healthchecks are reported OK at least every 5 seconds. For this we configured 
the timeout in theĀ  HealthCheckOptions to 5 seconds.

But we face rarely the situation that the system goes unhealthy without a 
healthcheck being executed. It even seems that none of the required healthcheck 
is executed during that time at all.

I already ruled out a few obvious cases (full GC, maxed out CPU), but I still 
have a few cases which I cannot explain yet. Also while checking the code, I 
found that on every invocation of the HealthcheckExecutor.execute() all 
metadata for the healthchecks are collected, which require access to the OSGI 
Service registry. My application also has situation where a lot of access to 
the Service registry happens, which can suffer from lock contention under load, 
and that is not included into the timeout calculation of the of the 
healthchecks.

As a first step I would like to add some more logging in case the overall 
execution of the healthchecks exceed the configured timeout.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to