[ 
https://issues.apache.org/jira/browse/FELIX-6663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joerg Hoh updated FELIX-6663:
-----------------------------
    Description: 
We monitor our system using Felix Healthchecks and require that some 
healthchecks are reported OK at least every 5 seconds. For this we configured 
the timeout in the  HealthCheckOptions to 5 seconds.

Sometimes we face the situation that the system goes unhealthy without a 
healthcheck being executed. It even seems that none of the required healthcheck 
is executed during that time at all.

I already ruled out a few obvious cases (full GC, maxed out CPU), but I still 
have a few cases which I cannot explain yet. Also while checking the code, I 
found that on every invocation of the HealthcheckExecutor.execute() all 
metadata for the healthchecks are collected, which require access to the OSGI 
Service registry. My application also has situation where a lot of access to 
the Service registry happens, which can suffer from lock contention under load, 
and that is not included into the timeout calculation of the of the 
healthchecks.

As a first step I would like to add some more logging in case the overall 
execution of the healthchecks exceed the configured timeout.

  was:
We monitor our system using Felix Healthchecks and require that some 
healthchecks are reported OK at least every 5 seconds. For this we configured 
the timeout in the  HealthCheckOptions to 5 seconds.

But we face rarely the situation that the system goes unhealthy without a 
healthcheck being executed. It even seems that none of the required healthcheck 
is executed during that time at all.

I already ruled out a few obvious cases (full GC, maxed out CPU), but I still 
have a few cases which I cannot explain yet. Also while checking the code, I 
found that on every invocation of the HealthcheckExecutor.execute() all 
metadata for the healthchecks are collected, which require access to the OSGI 
Service registry. My application also has situation where a lot of access to 
the Service registry happens, which can suffer from lock contention under load, 
and that is not included into the timeout calculation of the of the 
healthchecks.

As a first step I would like to add some more logging in case the overall 
execution of the healthchecks exceed the configured timeout.


> Warn if healthcheck execution takes too long
> --------------------------------------------
>
>                 Key: FELIX-6663
>                 URL: https://issues.apache.org/jira/browse/FELIX-6663
>             Project: Felix
>          Issue Type: Task
>          Components: Health Checks
>    Affects Versions: healthcheck.core 2.2.0
>            Reporter: Joerg Hoh
>            Priority: Major
>
> We monitor our system using Felix Healthchecks and require that some 
> healthchecks are reported OK at least every 5 seconds. For this we configured 
> the timeout in the  HealthCheckOptions to 5 seconds.
> Sometimes we face the situation that the system goes unhealthy without a 
> healthcheck being executed. It even seems that none of the required 
> healthcheck is executed during that time at all.
> I already ruled out a few obvious cases (full GC, maxed out CPU), but I still 
> have a few cases which I cannot explain yet. Also while checking the code, I 
> found that on every invocation of the HealthcheckExecutor.execute() all 
> metadata for the healthchecks are collected, which require access to the OSGI 
> Service registry. My application also has situation where a lot of access to 
> the Service registry happens, which can suffer from lock contention under 
> load, and that is not included into the timeout calculation of the of the 
> healthchecks.
> As a first step I would like to add some more logging in case the overall 
> execution of the healthchecks exceed the configured timeout.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to