[
https://issues.apache.org/jira/browse/SLING-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stefan Egli updated SLING-5965:
-------------------------------
Attachment: SLING-5965.v5.patch.txt
Thanks [~chetanm] for the review! I've reworked the patch based on the feedback
as follows (patch attached [^SLING-5965.v5.patch.txt]):
* {{@Modified}} : removed this. it did indeed not work. and turns out it caused
more trouble than be any good. so now a reconfig of the QuartzScheduler means
normal deactivation/activation cycle.
* {{createTemporaryGauge}} : nice catch. I've added a scheduled job (as part of
scheduler..) which runs every 30 minutes by default and calls all the temporary
gauges. this would unregister them automatically. and 30 minutes seems
perfectly fast enough as it's an edge case and slow enough to not have any
performance impact.
* _factoring out of QuartzScheduler_ : yes indeed. I've separated gauge related
code from QuartzScheduler to a new {{GaugesSupport}} osgi component which back
references the QuartzScheduler. It's the one which creates gauges, temporary
gauges and owns the cleanup job for temporary gauges.
* cleaned up the new metrics code in QuartzJobScheduler to be better separated.
* factored out {{MetricsHelper}} and {{ConfigHolder}}
futher I've done some performance tweaking:
* avoided initialization of the filter maps for each job execution
* restricted the _filter config_ (which allows to separate out heavy used jobs
in own, separately-named timer) to be just plain class names (or containing
classes for nested classes, ie without the {{$}} part) and the matching is now
done just based on equality of these (containing) class names. This allows eg
to still configure a filter for the ChangeProcessors like this:
{{obsstats=org.apache.jackrabbit.oak.jcr.observation.ChangeProcessor}} (while
the actual class is an anonymous one) - while at the same time use O(1) for the
lookup. This is important as the filter lookup is executed for each job. In the
previous patch this was a loop through the filters, hence O(n), due to
supporting of {{startsWith}} or {{contains}}. So now it's simply {{equals}} of
the containing class name.
should look nicer now.
> Metrics and a Health-Check for Scheduler to detect long-running Quartz-Jobs
> ---------------------------------------------------------------------------
>
> Key: SLING-5965
> URL: https://issues.apache.org/jira/browse/SLING-5965
> Project: Sling
> Issue Type: New Feature
> Components: Commons
> Affects Versions: Commons Scheduler 2.5.0
> Reporter: Stefan Egli
> Assignee: Stefan Egli
> Fix For: Commons Scheduler 2.6.4
>
> Attachments: numRunningJobs.jpg, oldestRunningJob.jpg,
> SchedulerHealthCheck.jpg, SLING-5965.patch, SLING-5965.v2.patch.txt,
> SLING-5965.v3.patch.txt, SLING-5965.v4.patch.txt, SLING-5965.v5.patch.txt,
> timers.jpg
>
>
> Sling Scheduler jobs (aka Quartz-Jobs) should typically be fast running jobs.
> They are served from a thread-pool and should occupy that thread only for a
> short amount of time.
> If there are 'misbehaving' quartz-jobs that run for a very long time, they
> start to occupy threads from that thread-pool, thus have an influence on the
> performance of other scheduled/quartz-jobs.
> We should have metrics (using
> [sling.commons.metrics|https://sling.apache.org/documentation/bundles/metrics.html])
> that provide information about internas of Sling Scheduler, such as average,
> max etc duration of scheduled jobs, as well as how many jobs are currently
> running and since when was the oldest job running.
> Based on this, a Health-Check can monitor the 'oldest job running' metric and
> flag {{critical}} when eg the oldest job is older than {{60'000ms}}
> (configurable, default).
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)