Giuseppe Lavagetto has uploaded a new change for review. ( 
https://gerrit.wikimedia.org/r/357366 )

Change subject: role::mediawiki::scaler: use more sensible intervals for checks
......................................................................

role::mediawiki::scaler: use more sensible intervals for checks

In order to reduce the number of false positives, it is more advisable
to perform more checks before alarming than having a huge retry
interval.

So do the following:
- raise the check interval to 5 minutes; we really don't need more
granularity than that on this alarm
- set the retry interval to 5 minutes too.
- set the check to go on 10 times before we actually raise an alert in
hard state

This should help reduce the number of false positives and avoid us
head-scratching moments where we check a machine that has recovered
since 20 minutes.

Change-Id: Ib5b86e0b5a8ebadbb2f9fac7b87a2289af981524
---
M modules/role/manifests/mediawiki/scaler.pp
1 file changed, 4 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/operations/puppet 
refs/changes/66/357366/1

diff --git a/modules/role/manifests/mediawiki/scaler.pp 
b/modules/role/manifests/mediawiki/scaler.pp
index 429e70e..b08fe72 100644
--- a/modules/role/manifests/mediawiki/scaler.pp
+++ b/modules/role/manifests/mediawiki/scaler.pp
@@ -23,15 +23,16 @@
     # a graceful restart happens, even if outstanding requests are not dropped
     # or marked as Graceful closing.
     # This means that daily tasks like logrotate cause false positives.
-    # A quick workaround is to use a high enough retry interval monitoring 
check,
+    # A quick workaround is to use a high enough number of retries monitoring 
check,
     # to give httpd time to restore its busy workers.
     # This is not an ideal solution but a constant rate of false positives
     # decreases the perceived importance of the alarm over time.
     nrpe::monitor_service { 'check_leaked_hhvm_threads':
         description    => 'Check HHVM threads for leakage',
         nrpe_command   => 
'/usr/local/lib/nagios/plugins/check_leaked_hhvm_threads',
-        retry_interval => 30,
-        retries        => 3,
+        check_interval => 5,
+        retry_interval => 5,
+        retries        => 10,
         require        => 
File['/usr/local/lib/nagios/plugins/check_leaked_hhvm_threads'],
     }
 }

-- 
To view, visit https://gerrit.wikimedia.org/r/357366
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ib5b86e0b5a8ebadbb2f9fac7b87a2289af981524
Gerrit-PatchSet: 1
Gerrit-Project: operations/puppet
Gerrit-Branch: production
Gerrit-Owner: Giuseppe Lavagetto <[email protected]>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to