Faidon Liambotis has uploaded a new change for review.

  https://gerrit.wikimedia.org/r/237998

Change subject: mediawiki: kill HHVM graphite checks
......................................................................

mediawiki: kill HHVM graphite checks

Kill HHVM "queue size" and "busy threads" Graphite alerts:

1) They are unreliable due to Graphite often missing data. I monitor
Icinga multiple times a day and there are always 10-30 of those fired up
at UNKNOWN for not having enough data.

2) In times of an outage, like the one that happened a few minutes ago,
these fire up like crazy, spamming the channel. What's worse is, that
because of them relying on percentages of historic behavior (i.e. past
data), they fire up long after the appserver has started misbehaving and
recover a long time after the appserver has actually recovered.
Sometimes they even fire up after the recovery.

This is misleading in times of an outage which is the worst thing a
check can be.

3) I've never seen those checks fire up for legitimate reasons while at
the same time our other checks (e.g. check_http) not having detected the
same failure. The cost of having ~800 checks is non-negligible and if
the benefits are dubious, we should not have them.

Remove them from now. If anyone feels strongly of readding them, I'd
suggest them to actually convert them to nrpe checks that check that
*current* state of an appserver with some higher momentary thresholds.
Alternatively, one could have a check_graphite check across all
appservers and alert when there are too many busy threads/long queues,
possibly as an indication of a non-obvious complex outage.

Change-Id: Icae30008116f6656eed8e7f1d05214d1d11fcb98
---
M modules/mediawiki/manifests/monitoring/webserver.pp
1 file changed, 0 insertions(+), 19 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/operations/puppet 
refs/changes/98/237998/1

diff --git a/modules/mediawiki/manifests/monitoring/webserver.pp 
b/modules/mediawiki/manifests/monitoring/webserver.pp
index 21ddfbc..a18cd10 100644
--- a/modules/mediawiki/manifests/monitoring/webserver.pp
+++ b/modules/mediawiki/manifests/monitoring/webserver.pp
@@ -14,25 +14,6 @@
             source   => 
'puppet:///modules/mediawiki/monitoring/collectors/hhvm.py',
             require  => Apache::Site['hhvm_admin'],
         }
-
-        monitoring::graphite_threshold { 'hhvm_queue_size':
-            description     => 'HHVM queue size',
-            metric          => 
"servers.${::hostname}.hhvmHealthCollector.queued",
-            warning         => 10,
-            critical        => 80,
-            percentage      => 30,
-            nagios_critical => false
-        }
-
-        monitoring::graphite_threshold { 'hhvm_load':
-            description     => 'HHVM busy threads',
-            metric          => 
"servers.${::hostname}.hhvmHealthCollector.load",
-            warning         => $::mediawiki::hhvm::max_threads*0.6,
-            critical        => $::mediawiki::hhvm::max_threads * 0.9,
-            percentage      => 30,
-            nagios_critical => false
-        }
-
     }
 
     file { '/var/www/monitoring':

-- 
To view, visit https://gerrit.wikimedia.org/r/237998
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Icae30008116f6656eed8e7f1d05214d1d11fcb98
Gerrit-PatchSet: 1
Gerrit-Project: operations/puppet
Gerrit-Branch: production
Gerrit-Owner: Faidon Liambotis <fai...@wikimedia.org>

_______________________________________________
MediaWiki-commits mailing list
MediaWiki-commits@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to