On 2014-08-25 at 18:02 -0500, Lawrence K. Chen, P.Eng. wrote: > I don't know much about pagerduty, except one group on campus that shares our > Nagios server is using it. > > So, there's perl script to tie into nagios hasn't left a good impression on > me.
On the bright side, it's Perl, and there's always _some_ sucker around who can read Perl. (Although, here, it's ... me. Damn.) This means that you can understand what is going to be running on your monitoring system, which is worth a lot to me. (Just last week, I integrated pagerduty into zabbix and it was a breeze). > In trying to see what it was doing...found its trying to post to some https > URL through LWP. Except it still seems that after more than 10+ years...LWP > https through a proxy is still busted, so don't know why this script would > expect to work.... The first answer on this question strongly suggests that LWP HTTPS has been solved, for some time now: <http://stackoverflow.com/questions/12116244/https-proxy-and-lwpuseragent> Although a script which can't handle timeouts without bombing is problematic. Myself, I suggest rewriting in Go. :) > And, a proxy is needed because the server is in private IP space (eventually > our entire datacenter will be....though sounds like it'll all be behind our > F5, but its been WIP for almost a couple of years now.) That is not a sufficient reason to need a proxy. You can use NAT to solve that problem. What makes more sense is that you might not _want_ all server instances to have unconstrained network access: if you can lock down which machines can talk where, then you can have more confidence about the communication patterns involved and what an attacker might or might not be able to do. Running a SOCKS proxy and logging connections made will get you an event stream which can in theory be inspected for anomalies. > After a couple of days, I opted for an earlier suggestion I had found online. > I used a Perl module of LWP-Proxy-Connect (still waiting to see if it'll get > accepted into FreeBSD Ports) to make the script work. Just a one line change > to the pagerduty script, IIRC, and it started working again.... CPAN on FreeBSD should be integrating into Ports, registering as an installed port. I haven't looked at how CPAN/FreeBSD ties into package creation, instead of just registering something as having been installed. I really strongly recommend taking a look at Poudriere, the new Ports building framework being used by the FreeBSD folks. It uses jails, dynamically creating jails for builds, to fully isolate the Ports. Create an overlay ports tree to contain the things which you want packaged, and you're in good shape. The only caveat is that you can't use bind mounts or symlinks to hold the overlay in place, since a read-only bind mount is used to expose the Ports tree into the build jails. This is just good motivation to keep the overlay in git and auto-sync in whatever setup tool you use to manage the tree updates. > While I was working on it, the group using it finally logged a couple of > tickets...one about unable to fork errors, and that they had stopped getting > notifications, where they thought there should've been some on the weekend. > (they were the ones killing my Nagios server.) I apologise in advance for the 20/20 hindsight here, but to be sure that you're looking at it: Monitor process count on the monitoring box, alert when it goes too high. You might not reach Pagerduty, but that's the sort of event, on an isolated box under your control, where you make sure that the contact for that service goes to multiple places, not just pagerduty. Start screaming blue bloody murder. > Finally they admitted that they had changed it that Friday from using email to > posting to web for notifications. (had I known, I might have just suggested > they switch it back :) Why did they switch in one move, instead of using _both_ pagerduty _and_ email? We still have some Nagios setups, they're doing this. Pagerduty is more reliable, because email is store-and-forward and our email service provider has a tendency to start temp-rejecting or just queueing the emails from Nagios when there's an incident which generates a flood. Pagerduty had a problem a few years back where they were in one single AWS failure zone. They learnt from the mistake and it's been impressive to see just how much they appear to have taken this as a call to action, improving their processes and failure management. They are pretty damned good these days and I have no qualms about using them. PD has decent escalation management, decent webhooks support for integrating into various chat systems, a nice way of handling oncall schedules with layers of overrides so you can see the net effect; good calendar exports, so you can both create per-rotation calendars and personal oncall times -- the latter is good for also adding to personal calendars for sharing with significant others, to allow for household coordination. My wife is quite fond of now automatically seeing when I'm going to be oncall. Their Android app is usable and doesn't drain battery, while responding quickly enough. I distrust cloud-to-device messaging enough that in my personal "Notification Rules" I use an SMS one minute after the push notification. The option to escalate to a robo voice call is something I've only seen one other place. My only real complaint with PagerDuty is that their UX fails horribly when you have people in different time-zones. I ended up just setting my account to Pacific time. My one annoyance (below complaint, but just annoyed) is that you can't associate a webhook posting as an action on an escalation schedule, so if you have multiple services using just a couple of escalation schedules, you still *have* to set the webhook on each service. At some point, I'm going to have to automate this, using their API. > Hadn't really thought about our notifications from this Nagios server now > being dependent on our smtp server....our old server had been in the > datacenter range that is completely open to the world....so it did its own > mail delivery (especially important when it used to largely inform us of > problems with campus email...) Though its getting hard for me to handle > notifications timely/safely.... See above: once you've debugged your integration and checked the failure modes, the active link and retries outside the store-and-forward result in more reliability. I still have the emails too, but they're mostly just something to purge from my mail folders when they finally straggle in. _______________________________________________ Discuss mailing list [email protected] https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss This list provided by the League of Professional System Administrators http://lopsa.org/
