[ https://issues.apache.org/jira/browse/SLING-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stefan Egli updated SLING-5965: ------------------------------- Attachment: SLING-5965.v5.patch.txt Thanks [~chetanm] for the review! I've reworked the patch based on the feedback as follows (patch attached [^SLING-5965.v5.patch.txt]): * {{@Modified}} : removed this. it did indeed not work. and turns out it caused more trouble than be any good. so now a reconfig of the QuartzScheduler means normal deactivation/activation cycle. * {{createTemporaryGauge}} : nice catch. I've added a scheduled job (as part of scheduler..) which runs every 30 minutes by default and calls all the temporary gauges. this would unregister them automatically. and 30 minutes seems perfectly fast enough as it's an edge case and slow enough to not have any performance impact. * _factoring out of QuartzScheduler_ : yes indeed. I've separated gauge related code from QuartzScheduler to a new {{GaugesSupport}} osgi component which back references the QuartzScheduler. It's the one which creates gauges, temporary gauges and owns the cleanup job for temporary gauges. * cleaned up the new metrics code in QuartzJobScheduler to be better separated. * factored out {{MetricsHelper}} and {{ConfigHolder}} futher I've done some performance tweaking: * avoided initialization of the filter maps for each job execution * restricted the _filter config_ (which allows to separate out heavy used jobs in own, separately-named timer) to be just plain class names (or containing classes for nested classes, ie without the {{$}} part) and the matching is now done just based on equality of these (containing) class names. This allows eg to still configure a filter for the ChangeProcessors like this: {{obsstats=org.apache.jackrabbit.oak.jcr.observation.ChangeProcessor}} (while the actual class is an anonymous one) - while at the same time use O(1) for the lookup. This is important as the filter lookup is executed for each job. In the previous patch this was a loop through the filters, hence O(n), due to supporting of {{startsWith}} or {{contains}}. So now it's simply {{equals}} of the containing class name. should look nicer now. > Metrics and a Health-Check for Scheduler to detect long-running Quartz-Jobs > --------------------------------------------------------------------------- > > Key: SLING-5965 > URL: https://issues.apache.org/jira/browse/SLING-5965 > Project: Sling > Issue Type: New Feature > Components: Commons > Affects Versions: Commons Scheduler 2.5.0 > Reporter: Stefan Egli > Assignee: Stefan Egli > Fix For: Commons Scheduler 2.6.4 > > Attachments: numRunningJobs.jpg, oldestRunningJob.jpg, > SchedulerHealthCheck.jpg, SLING-5965.patch, SLING-5965.v2.patch.txt, > SLING-5965.v3.patch.txt, SLING-5965.v4.patch.txt, SLING-5965.v5.patch.txt, > timers.jpg > > > Sling Scheduler jobs (aka Quartz-Jobs) should typically be fast running jobs. > They are served from a thread-pool and should occupy that thread only for a > short amount of time. > If there are 'misbehaving' quartz-jobs that run for a very long time, they > start to occupy threads from that thread-pool, thus have an influence on the > performance of other scheduled/quartz-jobs. > We should have metrics (using > [sling.commons.metrics|https://sling.apache.org/documentation/bundles/metrics.html]) > that provide information about internas of Sling Scheduler, such as average, > max etc duration of scheduled jobs, as well as how many jobs are currently > running and since when was the oldest job running. > Based on this, a Health-Check can monitor the 'oldest job running' metric and > flag {{critical}} when eg the oldest job is older than {{60'000ms}} > (configurable, default). -- This message was sent by Atlassian JIRA (v6.4.14#64029)