[ 
https://issues.apache.org/jira/browse/SLING-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefan Egli updated SLING-5965:
-------------------------------
    Attachment: SLING-5965.v5.patch.txt

Thanks [~chetanm] for the review! I've reworked the patch based on the feedback 
as follows (patch attached [^SLING-5965.v5.patch.txt]):
* {{@Modified}} : removed this. it did indeed not work. and turns out it caused 
more trouble than be any good. so now a reconfig of the QuartzScheduler means 
normal deactivation/activation cycle.
* {{createTemporaryGauge}} : nice catch. I've added a scheduled job (as part of 
scheduler..) which runs every 30 minutes by default and calls all the temporary 
gauges. this would unregister them automatically. and 30 minutes seems 
perfectly fast enough as it's an edge case and slow enough to not have any 
performance impact.
* _factoring out of QuartzScheduler_ : yes indeed. I've separated gauge related 
code from QuartzScheduler to a new {{GaugesSupport}} osgi component which back 
references the QuartzScheduler. It's the one which creates gauges, temporary 
gauges and owns the cleanup job for temporary gauges.
* cleaned up the new metrics code in QuartzJobScheduler to be better separated.
* factored out {{MetricsHelper}} and {{ConfigHolder}}

futher I've done some performance tweaking:
* avoided initialization of the filter maps for each job execution
* restricted the _filter config_ (which allows to separate out heavy used jobs 
in own, separately-named timer) to be just plain class names (or containing 
classes for nested classes, ie without the {{$}} part) and the matching is now 
done just based on equality of these (containing) class names. This allows eg 
to still configure a filter for the ChangeProcessors like this: 
{{obsstats=org.apache.jackrabbit.oak.jcr.observation.ChangeProcessor}} (while 
the actual class is an anonymous one) - while at the same time use O(1) for the 
lookup. This is important as the filter lookup is executed for each job. In the 
previous patch this was a loop through the filters, hence O(n), due to 
supporting of {{startsWith}} or {{contains}}. So now it's simply {{equals}} of 
the containing class name.

should look nicer now.

> Metrics and a Health-Check for Scheduler to detect long-running Quartz-Jobs
> ---------------------------------------------------------------------------
>
>                 Key: SLING-5965
>                 URL: https://issues.apache.org/jira/browse/SLING-5965
>             Project: Sling
>          Issue Type: New Feature
>          Components: Commons
>    Affects Versions: Commons Scheduler 2.5.0
>            Reporter: Stefan Egli
>            Assignee: Stefan Egli
>             Fix For: Commons Scheduler 2.6.4
>
>         Attachments: numRunningJobs.jpg, oldestRunningJob.jpg, 
> SchedulerHealthCheck.jpg, SLING-5965.patch, SLING-5965.v2.patch.txt, 
> SLING-5965.v3.patch.txt, SLING-5965.v4.patch.txt, SLING-5965.v5.patch.txt, 
> timers.jpg
>
>
> Sling Scheduler jobs (aka Quartz-Jobs) should typically be fast running jobs. 
> They are served from a thread-pool and should occupy that thread only for a 
> short amount of time.
> If there are 'misbehaving' quartz-jobs that run for a very long time, they 
> start to occupy threads from that thread-pool, thus have an influence on the 
> performance of other scheduled/quartz-jobs.
> We should have metrics (using 
> [sling.commons.metrics|https://sling.apache.org/documentation/bundles/metrics.html])
>  that provide information about internas of Sling Scheduler, such as average, 
> max etc duration of scheduled jobs, as well as how many jobs are currently 
> running and since when was the oldest job running.
> Based on this, a Health-Check can monitor the 'oldest job running' metric and 
> flag {{critical}} when eg the oldest job is older than {{60'000ms}} 
> (configurable, default).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to