Faidon Liambotis has uploaded a new change for review. https://gerrit.wikimedia.org/r/237998
Change subject: mediawiki: kill HHVM graphite checks ...................................................................... mediawiki: kill HHVM graphite checks Kill HHVM "queue size" and "busy threads" Graphite alerts: 1) They are unreliable due to Graphite often missing data. I monitor Icinga multiple times a day and there are always 10-30 of those fired up at UNKNOWN for not having enough data. 2) In times of an outage, like the one that happened a few minutes ago, these fire up like crazy, spamming the channel. What's worse is, that because of them relying on percentages of historic behavior (i.e. past data), they fire up long after the appserver has started misbehaving and recover a long time after the appserver has actually recovered. Sometimes they even fire up after the recovery. This is misleading in times of an outage which is the worst thing a check can be. 3) I've never seen those checks fire up for legitimate reasons while at the same time our other checks (e.g. check_http) not having detected the same failure. The cost of having ~800 checks is non-negligible and if the benefits are dubious, we should not have them. Remove them from now. If anyone feels strongly of readding them, I'd suggest them to actually convert them to nrpe checks that check that *current* state of an appserver with some higher momentary thresholds. Alternatively, one could have a check_graphite check across all appservers and alert when there are too many busy threads/long queues, possibly as an indication of a non-obvious complex outage. Change-Id: Icae30008116f6656eed8e7f1d05214d1d11fcb98 --- M modules/mediawiki/manifests/monitoring/webserver.pp 1 file changed, 0 insertions(+), 19 deletions(-) git pull ssh://gerrit.wikimedia.org:29418/operations/puppet refs/changes/98/237998/1 diff --git a/modules/mediawiki/manifests/monitoring/webserver.pp b/modules/mediawiki/manifests/monitoring/webserver.pp index 21ddfbc..a18cd10 100644 --- a/modules/mediawiki/manifests/monitoring/webserver.pp +++ b/modules/mediawiki/manifests/monitoring/webserver.pp @@ -14,25 +14,6 @@ source => 'puppet:///modules/mediawiki/monitoring/collectors/hhvm.py', require => Apache::Site['hhvm_admin'], } - - monitoring::graphite_threshold { 'hhvm_queue_size': - description => 'HHVM queue size', - metric => "servers.${::hostname}.hhvmHealthCollector.queued", - warning => 10, - critical => 80, - percentage => 30, - nagios_critical => false - } - - monitoring::graphite_threshold { 'hhvm_load': - description => 'HHVM busy threads', - metric => "servers.${::hostname}.hhvmHealthCollector.load", - warning => $::mediawiki::hhvm::max_threads*0.6, - critical => $::mediawiki::hhvm::max_threads * 0.9, - percentage => 30, - nagios_critical => false - } - } file { '/var/www/monitoring': -- To view, visit https://gerrit.wikimedia.org/r/237998 To unsubscribe, visit https://gerrit.wikimedia.org/r/settings Gerrit-MessageType: newchange Gerrit-Change-Id: Icae30008116f6656eed8e7f1d05214d1d11fcb98 Gerrit-PatchSet: 1 Gerrit-Project: operations/puppet Gerrit-Branch: production Gerrit-Owner: Faidon Liambotis <fai...@wikimedia.org> _______________________________________________ MediaWiki-commits mailing list MediaWiki-commits@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits