Rush has submitted this change and it was merged.
Change subject: icinga tools.checker Tools paging update
......................................................................
icinga tools.checker Tools paging update
* Adding paging for LDAP being unavailable
* Adding paging for puppetmaster catalogue
fetch fail. (for the checker host at least)
* make all criticals consistent
We had an event last night where LDAP was broken.
These checks did fire but it went unnoticed for 45
minutes. These checks have existed for awhile from
a catchpoint POV, icinga was added in 287723. They
seem to have been stable over 3 days (other than
failing when expected due to actual events).
Currently no alerting catches these cases and yet they
can leave labs unusable.
note: amended to remove paging for broken puppet. seems
like overkill for the moment.
Change-Id: I31aa3bea004b71aa21c1ba217e52c3d97ea6a0b0
---
M modules/icinga/manifests/monitor/toollabs.pp
1 file changed, 3 insertions(+), 2 deletions(-)
Approvals:
Andrew Bogott: Looks good to me, but someone else must approve
Rush: Verified; Looks good to me, approved
jenkins-bot: Verified
diff --git a/modules/icinga/manifests/monitor/toollabs.pp
b/modules/icinga/manifests/monitor/toollabs.pp
index abee4e1..5208822 100644
--- a/modules/icinga/manifests/monitor/toollabs.pp
+++ b/modules/icinga/manifests/monitor/toollabs.pp
@@ -50,8 +50,8 @@
monitoring::service { 'tools-checker-self':
description => 'toolschecker service itself needs to return OK',
check_command => "${checker}!/self!OK",
- critical => true,
host => $test_entry_host,
+ critical => true,
}
monitoring::service { 'tools-checker-dumps':
@@ -70,6 +70,7 @@
description => 'Test LDAP for query',
check_command => "${checker}!/ldap!OK",
host => $test_entry_host,
+ critical => true,
}
monitoring::service { 'tools-checker-puppetmaster-eqiad':
@@ -87,8 +88,8 @@
monitoring::service { 'tools-checker-nfs-home':
description => 'NFS read/writeable on labs instances',
check_command => "${checker}!/nfs/home!OK",
- critical => true,
host => $test_entry_host,
+ critical => true,
}
# new instances will block on this for spinup if failing
--
To view, visit https://gerrit.wikimedia.org/r/288603
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings
Gerrit-MessageType: merged
Gerrit-Change-Id: I31aa3bea004b71aa21c1ba217e52c3d97ea6a0b0
Gerrit-PatchSet: 3
Gerrit-Project: operations/puppet
Gerrit-Branch: production
Gerrit-Owner: Rush <[email protected]>
Gerrit-Reviewer: Andrew Bogott <[email protected]>
Gerrit-Reviewer: Rush <[email protected]>
Gerrit-Reviewer: Yuvipanda <[email protected]>
Gerrit-Reviewer: jenkins-bot <>
_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits