Giuseppe Lavagetto has uploaded a new change for review. (
https://gerrit.wikimedia.org/r/357366 )
Change subject: role::mediawiki::scaler: use more sensible intervals for checks
......................................................................
role::mediawiki::scaler: use more sensible intervals for checks
In order to reduce the number of false positives, it is more advisable
to perform more checks before alarming than having a huge retry
interval.
So do the following:
- raise the check interval to 5 minutes; we really don't need more
granularity than that on this alarm
- set the retry interval to 5 minutes too.
- set the check to go on 10 times before we actually raise an alert in
hard state
This should help reduce the number of false positives and avoid us
head-scratching moments where we check a machine that has recovered
since 20 minutes.
Change-Id: Ib5b86e0b5a8ebadbb2f9fac7b87a2289af981524
---
M modules/role/manifests/mediawiki/scaler.pp
1 file changed, 4 insertions(+), 3 deletions(-)
git pull ssh://gerrit.wikimedia.org:29418/operations/puppet
refs/changes/66/357366/1
diff --git a/modules/role/manifests/mediawiki/scaler.pp
b/modules/role/manifests/mediawiki/scaler.pp
index 429e70e..b08fe72 100644
--- a/modules/role/manifests/mediawiki/scaler.pp
+++ b/modules/role/manifests/mediawiki/scaler.pp
@@ -23,15 +23,16 @@
# a graceful restart happens, even if outstanding requests are not dropped
# or marked as Graceful closing.
# This means that daily tasks like logrotate cause false positives.
- # A quick workaround is to use a high enough retry interval monitoring
check,
+ # A quick workaround is to use a high enough number of retries monitoring
check,
# to give httpd time to restore its busy workers.
# This is not an ideal solution but a constant rate of false positives
# decreases the perceived importance of the alarm over time.
nrpe::monitor_service { 'check_leaked_hhvm_threads':
description => 'Check HHVM threads for leakage',
nrpe_command =>
'/usr/local/lib/nagios/plugins/check_leaked_hhvm_threads',
- retry_interval => 30,
- retries => 3,
+ check_interval => 5,
+ retry_interval => 5,
+ retries => 10,
require =>
File['/usr/local/lib/nagios/plugins/check_leaked_hhvm_threads'],
}
}
--
To view, visit https://gerrit.wikimedia.org/r/357366
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ib5b86e0b5a8ebadbb2f9fac7b87a2289af981524
Gerrit-PatchSet: 1
Gerrit-Project: operations/puppet
Gerrit-Branch: production
Gerrit-Owner: Giuseppe Lavagetto <[email protected]>
_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits