[ https://issues.apache.org/jira/browse/SLING-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stefan Egli updated SLING-5965: ------------------------------- Attachment: SLING-5965.v3.patch.txt numRunningJobs.tiff oldestRunningJob.tiff timers.tiff SchedulerHealthCheck.tiff Attached [^SLING-5965.v3.patch.txt] h4. metrics * the following metrics exist: ** number of currently running jobs ** oldest currently running job - if one is above a threshold (1000ms by default) and it creates a temporary gauge for just that slow one, indicating the name of the slow job ** timers over all jobs * all of the above is done ** grouped by thread pool name ** grouped by a configurable filter (to separate certain known slow or frequent jobs for example) ** grouped by slow jobs (auto-detected and auto-created when hit) h4. number of running jobs metrics example !numRunningJobs.tiff|thumbnail! h4. oldest running job metrics example !oldestRunningJob.tiff|thumbnail! h4. timers metrics example !timers.tiff|thumbnail! h4. Scheduler Health Check There's a scheduler health check which does the following: * if there are 0 running jobs it's all green * if there are 1 or more running jobs it checks how old the oldest one is * if the oldest is older than what's configured (60000ms by default) then this health-check becomes red and it tries to extract more infos as to which job is slow. it does that by listing all {{sling:commons.scheduler.oldest.running.job.millis.slow.}} gauges and shows for each how old it is (these {{slow}} gauges are auto-created when accessing any of the other {{sling:commons.scheduler.oldest.running.job.millis.}} gauges). !SchedulerHealthCheck.tiff|thumbnail! reviews very welcome, /cc [~chetanm], [~cziegeler] > Metrics and a Health-Check for Scheduler to detect long-running Quartz-Jobs > --------------------------------------------------------------------------- > > Key: SLING-5965 > URL: https://issues.apache.org/jira/browse/SLING-5965 > Project: Sling > Issue Type: New Feature > Components: Commons > Affects Versions: Commons Scheduler 2.5.0 > Reporter: Stefan Egli > Assignee: Stefan Egli > Fix For: Commons Scheduler 2.6.4 > > Attachments: numRunningJobs.tiff, oldestRunningJob.tiff, > SchedulerHealthCheck.tiff, SLING-5965.patch, SLING-5965.v2.patch.txt, > SLING-5965.v3.patch.txt, timers.tiff > > > Sling Scheduler jobs (aka Quartz-Jobs) should typically be fast running jobs. > They are served from a thread-pool and should occupy that thread only for a > short amount of time. > If there are 'misbehaving' quartz-jobs that run for a very long time, they > start to occupy threads from that thread-pool, thus have an influence on the > performance of other scheduled/quartz-jobs. > We should have metrics (using > [sling.commons.metrics|https://sling.apache.org/documentation/bundles/metrics.html]) > that provide information about internas of Sling Scheduler, such as average, > max etc duration of scheduled jobs, as well as how many jobs are currently > running and since when was the oldest job running. > Based on this, a Health-Check can monitor the 'oldest job running' metric and > flag {{critical}} when eg the oldest job is older than {{60'000ms}} > (configurable, default). -- This message was sent by Atlassian JIRA (v6.4.14#64029)