Elukey has uploaded a new change for review. ( 
https://gerrit.wikimedia.org/r/336212 )

Change subject: Raise the retry interval of the Apache/HHVM leak monitor
......................................................................

Raise the retry interval of the Apache/HHVM leak monitor

The first attempt was https://gerrit.wikimedia.org/r/#/c/334505,
in which the check_leaked_hhvm_threads.py script was modified
to avoid any check if the httpd's uptime was less than two hours.
The solution turned out to be a failure since httpd does not reset
its uptime counter when a graceful reload happens (only after
a restart). The alternative solution is to transfer the delay
to the monitor, allowing more time after the first failure
detection to avoid false positives.

Change-Id: I76e9ef08c440378a13da96f54a22b8e00378161f
---
M modules/role/files/mediawiki/check_leaked_hhvm_threads.py
M modules/role/manifests/mediawiki/scaler.pp
2 files changed, 15 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/operations/puppet 
refs/changes/12/336212/1

diff --git a/modules/role/files/mediawiki/check_leaked_hhvm_threads.py 
b/modules/role/files/mediawiki/check_leaked_hhvm_threads.py
index 9c338fe..8f18ef6 100755
--- a/modules/role/files/mediawiki/check_leaked_hhvm_threads.py
+++ b/modules/role/files/mediawiki/check_leaked_hhvm_threads.py
@@ -37,19 +37,13 @@
 PERC_CRITICAL = 2.0
 
 # Perform checks only if the uptime is more than this threshold (seconds).
-UPTIME_THRESHOLD = 7200
+# This option leaves a bit of ramp up time to Apache/HHVM to reach a steady
+# running state.
+UPTIME_THRESHOLD = 1800
 
 # I know there is a race condition here. But we can live with that.
 try:
     apache_status = requests.get('http://127.0.0.1/server-status?auto')
-    # For some versions of httpd (like 2.4.7), BusyWorkers are set to zero when
-    # a graceful restart happens, even if outstanding requests are not dropped
-    # or marked as Graceful closing.
-    # This means that daily tasks like logrotate cause false positives.
-    # A quick workaround is to limit the check only when the Uptime is more
-    # than a couple of hours, to give httpd time to restore its busy workers.
-    # This is not an ideal solution but a constant rate of false positives
-    # decreases the perceived importance of the alarm over time.
     match = re.search('Uptime: (\d+)', apache_status.text)
     if not match:
         print('UNKNOWN - Could not find apache uptime in apache status')
diff --git a/modules/role/manifests/mediawiki/scaler.pp 
b/modules/role/manifests/mediawiki/scaler.pp
index 7175269..5fda286 100644
--- a/modules/role/manifests/mediawiki/scaler.pp
+++ b/modules/role/manifests/mediawiki/scaler.pp
@@ -19,9 +19,18 @@
         group  => 'root',
     }
 
+    # For some versions of httpd (like 2.4.7), BusyWorkers are set to zero when
+    # a graceful restart happens, even if outstanding requests are not dropped
+    # or marked as Graceful closing.
+    # This means that daily tasks like logrotate cause false positives.
+    # A quick workaround is to use a high enough retry interval monitoring 
check,
+    # to give httpd time to restore its busy workers.
+    # This is not an ideal solution but a constant rate of false positives
+    # decreases the perceived importance of the alarm over time.
     nrpe::monitor_service { 'check_leaked_hhvm_threads':
-        description  => 'Check HHVM threads for leakage',
-        nrpe_command => 
'/usr/local/lib/nagios/plugins/check_leaked_hhvm_threads',
-        require      => 
File['/usr/local/lib/nagios/plugins/check_leaked_hhvm_threads'],
+        description    => 'Check HHVM threads for leakage',
+        nrpe_command   => 
'/usr/local/lib/nagios/plugins/check_leaked_hhvm_threads',
+        retry_interval => 30,
+        require        => 
File['/usr/local/lib/nagios/plugins/check_leaked_hhvm_threads'],
     }
 }

-- 
To view, visit https://gerrit.wikimedia.org/r/336212
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I76e9ef08c440378a13da96f54a22b8e00378161f
Gerrit-PatchSet: 1
Gerrit-Project: operations/puppet
Gerrit-Branch: production
Gerrit-Owner: Elukey <ltosc...@wikimedia.org>

_______________________________________________
MediaWiki-commits mailing list
MediaWiki-commits@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to