Greetings. 

I just made a nagios change that causes it to send the very first alert
for something to _just_ irc. 

If you are active and looking at a problem at this point, please go and
ack it on the web interface. This will stop escalations. 

It will then wait 10minutes and the next alert (if the problem hasn't
recovered or been acked) will go to irc, email and pagers.

It will then send every hour after that to irc, email, and pager until
the problem is acked or solved. 

Rationale: 

* Much of the time now we have someone on irc who can look at and fix
  issues (since we have sysadmin main folks in europe). Paging everyone
  is causing pager fatigue especially when someone else is already
  fixing it. 

* We get a lot of alerts that are short network caused things that
  recover in a few minutes. There's usually 0 we can do about them, our
  users never notice them, and it's causing pager fatigue to page on
  them and then immediately page ok after bothering people. Ideally we
  would adjust these checks, and we should, but it's going to take a
  while to get them all right. 

* We often get a lot of alerts from 1 proxy or the like being rebooted
  or restarting apache. These usually only happen for a minute or two
  and there's no need to page on them. 

* We sometimes get alerts directly related to changes we are currently
  making in something and then go fix them. There's no need to page
  someone for this, just be aware of irc when making playbook or host
  changes and clean up anything you cause to alert. 

I'd like to get back to the idea that if you get a page it's an
important thing you need to go look at, not "oh, nagios again".

This is all subject to adjustment, but hopefully it will make life a
bit easier for us sysadmin types and not cause any problems for anyone
else. ;) 

kevin

Attachment: pgpaVqjV9XoTu.pgp
Description: OpenPGP digital signature

_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure

Reply via email to