Elukey has uploaded a new change for review. ( https://gerrit.wikimedia.org/r/336212 )
Change subject: Raise the retry interval of the Apache/HHVM leak monitor ...................................................................... Raise the retry interval of the Apache/HHVM leak monitor The first attempt was https://gerrit.wikimedia.org/r/#/c/334505, in which the check_leaked_hhvm_threads.py script was modified to avoid any check if the httpd's uptime was less than two hours. The solution turned out to be a failure since httpd does not reset its uptime counter when a graceful reload happens (only after a restart). The alternative solution is to transfer the delay to the monitor, allowing more time after the first failure detection to avoid false positives. Change-Id: I76e9ef08c440378a13da96f54a22b8e00378161f --- M modules/role/files/mediawiki/check_leaked_hhvm_threads.py M modules/role/manifests/mediawiki/scaler.pp 2 files changed, 15 insertions(+), 12 deletions(-) git pull ssh://gerrit.wikimedia.org:29418/operations/puppet refs/changes/12/336212/1 diff --git a/modules/role/files/mediawiki/check_leaked_hhvm_threads.py b/modules/role/files/mediawiki/check_leaked_hhvm_threads.py index 9c338fe..8f18ef6 100755 --- a/modules/role/files/mediawiki/check_leaked_hhvm_threads.py +++ b/modules/role/files/mediawiki/check_leaked_hhvm_threads.py @@ -37,19 +37,13 @@ PERC_CRITICAL = 2.0 # Perform checks only if the uptime is more than this threshold (seconds). -UPTIME_THRESHOLD = 7200 +# This option leaves a bit of ramp up time to Apache/HHVM to reach a steady +# running state. +UPTIME_THRESHOLD = 1800 # I know there is a race condition here. But we can live with that. try: apache_status = requests.get('http://127.0.0.1/server-status?auto') - # For some versions of httpd (like 2.4.7), BusyWorkers are set to zero when - # a graceful restart happens, even if outstanding requests are not dropped - # or marked as Graceful closing. - # This means that daily tasks like logrotate cause false positives. - # A quick workaround is to limit the check only when the Uptime is more - # than a couple of hours, to give httpd time to restore its busy workers. - # This is not an ideal solution but a constant rate of false positives - # decreases the perceived importance of the alarm over time. match = re.search('Uptime: (\d+)', apache_status.text) if not match: print('UNKNOWN - Could not find apache uptime in apache status') diff --git a/modules/role/manifests/mediawiki/scaler.pp b/modules/role/manifests/mediawiki/scaler.pp index 7175269..5fda286 100644 --- a/modules/role/manifests/mediawiki/scaler.pp +++ b/modules/role/manifests/mediawiki/scaler.pp @@ -19,9 +19,18 @@ group => 'root', } + # For some versions of httpd (like 2.4.7), BusyWorkers are set to zero when + # a graceful restart happens, even if outstanding requests are not dropped + # or marked as Graceful closing. + # This means that daily tasks like logrotate cause false positives. + # A quick workaround is to use a high enough retry interval monitoring check, + # to give httpd time to restore its busy workers. + # This is not an ideal solution but a constant rate of false positives + # decreases the perceived importance of the alarm over time. nrpe::monitor_service { 'check_leaked_hhvm_threads': - description => 'Check HHVM threads for leakage', - nrpe_command => '/usr/local/lib/nagios/plugins/check_leaked_hhvm_threads', - require => File['/usr/local/lib/nagios/plugins/check_leaked_hhvm_threads'], + description => 'Check HHVM threads for leakage', + nrpe_command => '/usr/local/lib/nagios/plugins/check_leaked_hhvm_threads', + retry_interval => 30, + require => File['/usr/local/lib/nagios/plugins/check_leaked_hhvm_threads'], } } -- To view, visit https://gerrit.wikimedia.org/r/336212 To unsubscribe, visit https://gerrit.wikimedia.org/r/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I76e9ef08c440378a13da96f54a22b8e00378161f Gerrit-PatchSet: 1 Gerrit-Project: operations/puppet Gerrit-Branch: production Gerrit-Owner: Elukey <ltosc...@wikimedia.org> _______________________________________________ MediaWiki-commits mailing list MediaWiki-commits@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits