Re: [lopsa-discuss] Pager rotation for nagios

Phil Pennock Mon, 25 Aug 2014 17:49:40 -0700

On 2014-08-25 at 18:02 -0500, Lawrence K. Chen, P.Eng. wrote:
> I don't know much about pagerduty, except one group on campus that shares our
> Nagios server is using it.
> 
> So, there's perl script to tie into nagios hasn't left a good impression on 
> me.


On the bright side, it's Perl, and there's always _some_ sucker around
who can read Perl.  (Although, here, it's ... me.  Damn.)  This means
that you can understand what is going to be running on your monitoring
system, which is worth a lot to me.

(Just last week, I integrated pagerduty into zabbix and it was a
breeze).

> In trying to see what it was doing...found its trying to post to some https
> URL through LWP.  Except it still seems that after more than 10+ years...LWP
> https through a proxy is still busted, so don't know why this script would
> expect to work....

The first answer on this question strongly suggests that LWP HTTPS has
been solved, for some time now:
  <http://stackoverflow.com/questions/12116244/https-proxy-and-lwpuseragent>

Although a script which can't handle timeouts without bombing is
problematic.  Myself, I suggest rewriting in Go.  :)

> And, a proxy is needed because the server is in private IP space (eventually
> our entire datacenter will be....though sounds like it'll all be behind our
> F5, but its been WIP for almost a couple of years now.)

That is not a sufficient reason to need a proxy.  You can use NAT to
solve that problem.  What makes more sense is that you might not _want_
all server instances to have unconstrained network access: if you can
lock down which machines can talk where, then you can have more
confidence about the communication patterns involved and what an
attacker might or might not be able to do.  Running a SOCKS proxy and
logging connections made will get you an event stream which can in
theory be inspected for anomalies.

> After a couple of days, I opted for an earlier suggestion I had found online.
>  I used a Perl module of LWP-Proxy-Connect (still waiting to see if it'll get
> accepted into FreeBSD Ports) to make the script work.  Just a one line change
> to the pagerduty script, IIRC, and it started working again....

CPAN on FreeBSD should be integrating into Ports, registering as an
installed port.  I haven't looked at how CPAN/FreeBSD ties into package
creation, instead of just registering something as having been
installed.

I really strongly recommend taking a look at Poudriere, the new Ports
building framework being used by the FreeBSD folks.  It uses jails,
dynamically creating jails for builds, to fully isolate the Ports.
Create an overlay ports tree to contain the things which you want
packaged, and you're in good shape.  The only caveat is that you can't
use bind mounts or symlinks to hold the overlay in place, since a
read-only bind mount is used to expose the Ports tree into the build
jails.  This is just good motivation to keep the overlay in git and
auto-sync in whatever setup tool you use to manage the tree updates.


> While I was working on it, the group using it finally logged a couple of
> tickets...one about unable to fork errors, and that they had stopped getting
> notifications, where they thought there should've been some on the weekend.
> (they were the ones killing my Nagios server.)

I apologise in advance for the 20/20 hindsight here, but to be sure
that you're looking at it:

Monitor process count on the monitoring box, alert when it goes too
high.  You might not reach Pagerduty, but that's the sort of event, on
an isolated box under your control, where you make sure that the contact
for that service goes to multiple places, not just pagerduty.  Start
screaming blue bloody murder.

> Finally they admitted that they had changed it that Friday from using email to
> posting to web for notifications.  (had I known, I might have just suggested
> they switch it back :)

Why did they switch in one move, instead of using _both_ pagerduty _and_
email?

We still have some Nagios setups, they're doing this.  Pagerduty is more
reliable, because email is store-and-forward and our email service
provider has a tendency to start temp-rejecting or just queueing the
emails from Nagios when there's an incident which generates a flood.

Pagerduty had a problem a few years back where they were in one single
AWS failure zone.  They learnt from the mistake and it's been impressive
to see just how much they appear to have taken this as a call to action,
improving their processes and failure management.  They are pretty
damned good these days and I have no qualms about using them.

PD has decent escalation management, decent webhooks support for
integrating into various chat systems, a nice way of handling oncall
schedules with layers of overrides so you can see the net effect; good
calendar exports, so you can both create per-rotation calendars and
personal oncall times -- the latter is good for also adding to personal
calendars for sharing with significant others, to allow for household
coordination.  My wife is quite fond of now automatically seeing when
I'm going to be oncall.

Their Android app is usable and doesn't drain battery, while responding
quickly enough.  I distrust cloud-to-device messaging enough that in my
personal "Notification Rules" I use an SMS one minute after the push
notification.  The option to escalate to a robo voice call is something
I've only seen one other place.

My only real complaint with PagerDuty is that their UX fails horribly
when you have people in different time-zones.  I ended up just setting
my account to Pacific time.

My one annoyance (below complaint, but just annoyed) is that you can't
associate a webhook posting as an action on an escalation schedule, so
if you have multiple services using just a couple of escalation
schedules, you still *have* to set the webhook on each service.  At some
point, I'm going to have to automate this, using their API.


> Hadn't really thought about our notifications from this Nagios server now
> being dependent on our smtp server....our old server had been in the
> datacenter range that is completely open to the world....so it did its own
> mail delivery (especially important when it used to largely inform us of
> problems with campus email...)  Though its getting hard for me to handle
> notifications timely/safely....

See above: once you've debugged your integration and checked the failure
modes, the active link and retries outside the store-and-forward result
in more reliability.  I still have the emails too, but they're mostly
just something to purge from my mail folders when they finally straggle
in.
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Re: [lopsa-discuss] Pager rotation for nagios

Reply via email to