Gehel has uploaded a new change for review. https://gerrit.wikimedia.org/r/315651
Change subject: wdqs - move monitoring of response time to service, not individual hosts ...................................................................... wdqs - move monitoring of response time to service, not individual hosts As varnish is now configured to use the wdqs LVS service, monitoring needs to be adapted as well. At this point, only the eqiad service is monitored as the codfw service does not receive traffic, and thus response times are meaningless. Bug: T148015 Change-Id: Ifc86e4b60e8a67bb03e648271c8ffea0bbdf4551 --- A modules/icinga/manifests/monitor/wdqs.pp M modules/wdqs/manifests/monitor/blazegraph.pp 2 files changed, 17 insertions(+), 13 deletions(-) git pull ssh://gerrit.wikimedia.org:29418/operations/puppet refs/changes/51/315651/1 diff --git a/modules/icinga/manifests/monitor/wdqs.pp b/modules/icinga/manifests/monitor/wdqs.pp new file mode 100644 index 0000000..09d8f31 --- /dev/null +++ b/modules/icinga/manifests/monitor/wdqs.pp @@ -0,0 +1,17 @@ +# Monitor Wikidata query service +class icinga::monitor::wdqs { + + # raise a warning / critical alert if response time was over 2 minutes / 5 minutes + # more than 5% of the time during the last minute + monitoring::graphite_threshold { 'wdqs-response-time': + description => 'Response time of WDQS', + host => 'wdqs.svc.eqiad.wmnet', + metric => "varnish.eqiad.backends.be_wdqs_svc_eqiad_wmnet.GET.p99", + warning => 120000, # 2 minutes + critical => 300000, # 5 minutes + from => '10min', + percentage => 5, + contact_group => 'wdqs-admins', + } + +} diff --git a/modules/wdqs/manifests/monitor/blazegraph.pp b/modules/wdqs/manifests/monitor/blazegraph.pp index 46acbeb..edeaa2b 100644 --- a/modules/wdqs/manifests/monitor/blazegraph.pp +++ b/modules/wdqs/manifests/monitor/blazegraph.pp @@ -26,18 +26,5 @@ source => 'puppet:///modules/wdqs/monitor/blazegraph.py', } - # raise a warning / critical alert if response time was over 2 minutes / 5 minutes - # more than 5% of the time during the last minute - $sanitized_hostname = regsubst($::fqdn, '\.', '_', 'G') - monitoring::graphite_threshold { 'wdqs-response-time': - description => 'Response time of WDQS', - metric => "varnish.eqiad.backends.be_${sanitized_hostname}.GET.p99", - warning => 120000, # 2 minutes - critical => 300000, # 5 minutes - from => '10min', - percentage => 5, - contact_group => 'wdqs-admins', - } - # TODO: add monitoring of the http and https endpoints, and of the service } -- To view, visit https://gerrit.wikimedia.org/r/315651 To unsubscribe, visit https://gerrit.wikimedia.org/r/settings Gerrit-MessageType: newchange Gerrit-Change-Id: Ifc86e4b60e8a67bb03e648271c8ffea0bbdf4551 Gerrit-PatchSet: 1 Gerrit-Project: operations/puppet Gerrit-Branch: production Gerrit-Owner: Gehel <gleder...@wikimedia.org> _______________________________________________ MediaWiki-commits mailing list MediaWiki-commits@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits