Gehel has uploaded a new change for review.

  https://gerrit.wikimedia.org/r/315651

Change subject: wdqs - move monitoring of response time to service, not 
individual hosts
......................................................................

wdqs - move monitoring of response time to service, not individual hosts

As varnish is now configured to use the wdqs LVS service, monitoring needs
to be adapted as well.

At this point, only the eqiad service is monitored as the codfw service
does not receive traffic, and thus response times are meaningless.

Bug: T148015
Change-Id: Ifc86e4b60e8a67bb03e648271c8ffea0bbdf4551
---
A modules/icinga/manifests/monitor/wdqs.pp
M modules/wdqs/manifests/monitor/blazegraph.pp
2 files changed, 17 insertions(+), 13 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/operations/puppet 
refs/changes/51/315651/1

diff --git a/modules/icinga/manifests/monitor/wdqs.pp 
b/modules/icinga/manifests/monitor/wdqs.pp
new file mode 100644
index 0000000..09d8f31
--- /dev/null
+++ b/modules/icinga/manifests/monitor/wdqs.pp
@@ -0,0 +1,17 @@
+# Monitor Wikidata query service
+class icinga::monitor::wdqs {
+
+    # raise a warning / critical alert if response time was over 2 minutes / 5 
minutes
+    # more than 5% of the time during the last minute
+    monitoring::graphite_threshold { 'wdqs-response-time':
+        description   => 'Response time of WDQS',
+        host          => 'wdqs.svc.eqiad.wmnet',
+        metric        => 
"varnish.eqiad.backends.be_wdqs_svc_eqiad_wmnet.GET.p99",
+        warning       => 120000, # 2 minutes
+        critical      => 300000, # 5 minutes
+        from          => '10min',
+        percentage    => 5,
+        contact_group => 'wdqs-admins',
+    }
+
+}
diff --git a/modules/wdqs/manifests/monitor/blazegraph.pp 
b/modules/wdqs/manifests/monitor/blazegraph.pp
index 46acbeb..edeaa2b 100644
--- a/modules/wdqs/manifests/monitor/blazegraph.pp
+++ b/modules/wdqs/manifests/monitor/blazegraph.pp
@@ -26,18 +26,5 @@
         source   => 'puppet:///modules/wdqs/monitor/blazegraph.py',
     }
 
-    # raise a warning / critical alert if response time was over 2 minutes / 5 
minutes
-    # more than 5% of the time during the last minute
-    $sanitized_hostname = regsubst($::fqdn, '\.', '_', 'G')
-    monitoring::graphite_threshold { 'wdqs-response-time':
-        description   => 'Response time of WDQS',
-        metric        => 
"varnish.eqiad.backends.be_${sanitized_hostname}.GET.p99",
-        warning       => 120000, # 2 minutes
-        critical      => 300000, # 5 minutes
-        from          => '10min',
-        percentage    => 5,
-        contact_group => 'wdqs-admins',
-    }
-
     # TODO: add monitoring of the http and https endpoints, and of the service
 }

-- 
To view, visit https://gerrit.wikimedia.org/r/315651
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ifc86e4b60e8a67bb03e648271c8ffea0bbdf4551
Gerrit-PatchSet: 1
Gerrit-Project: operations/puppet
Gerrit-Branch: production
Gerrit-Owner: Gehel <gleder...@wikimedia.org>

_______________________________________________
MediaWiki-commits mailing list
MediaWiki-commits@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to