Rush has uploaded a new change for review.

  https://gerrit.wikimedia.org/r/288603

Change subject: icinga tools.checker Tools paging update
......................................................................

icinga tools.checker Tools paging update

* Adding paging for LDAP being unavailable
* Adding paging for puppetmaster catalogue
  fetch fail. (for the checker host at least)
* make all criticals consistent

We had an event last night where LDAP was broken.
These checks did fire but it went unnoticed for 45
minutes.  These checks have existed for awhile from
a catchpoint POV, icinga was added in 287723.  They
seem to have been stable over 3 days (other than
failing when expected due to actual events).

Currently no alerting catches these cases and yet they
can leave labs unusable.

Change-Id: I31aa3bea004b71aa21c1ba217e52c3d97ea6a0b0
---
M modules/icinga/manifests/monitor/toollabs.pp
1 file changed, 4 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/operations/puppet 
refs/changes/03/288603/1

diff --git a/modules/icinga/manifests/monitor/toollabs.pp 
b/modules/icinga/manifests/monitor/toollabs.pp
index abee4e1..ff5bbb3 100644
--- a/modules/icinga/manifests/monitor/toollabs.pp
+++ b/modules/icinga/manifests/monitor/toollabs.pp
@@ -50,8 +50,8 @@
     monitoring::service { 'tools-checker-self':
         description   => 'toolschecker service itself needs to return OK',
         check_command => "${checker}!/self!OK",
-        critical      => true,
         host          => $test_entry_host,
+        critical      => true,
     }
 
     monitoring::service { 'tools-checker-dumps':
@@ -70,12 +70,14 @@
         description   => 'Test LDAP for query',
         check_command => "${checker}!/ldap!OK",
         host          => $test_entry_host,
+        critical      => true,
     }
 
     monitoring::service { 'tools-checker-puppetmaster-eqiad':
         description   => 'Puppet catalogue fetch',
         check_command => "${checker}!/labs-puppetmaster/eqiad!OK",
         host          => $test_entry_host,
+        critical      => true,
     }
 
     monitoring::service { 'tools-checker-labs-dns-private':
@@ -87,8 +89,8 @@
     monitoring::service { 'tools-checker-nfs-home':
         description   => 'NFS read/writeable on labs instances',
         check_command => "${checker}!/nfs/home!OK",
-        critical      => true,
         host          => $test_entry_host,
+        critical      => true,
     }
 
     # new instances will block on this for spinup if failing

-- 
To view, visit https://gerrit.wikimedia.org/r/288603
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I31aa3bea004b71aa21c1ba217e52c3d97ea6a0b0
Gerrit-PatchSet: 1
Gerrit-Project: operations/puppet
Gerrit-Branch: production
Gerrit-Owner: Rush <[email protected]>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to