Re: [lopsa-discuss] Pager rotation for nagios

Lawrence K. Chen, P.Eng. Tue, 26 Aug 2014 15:47:50 -0700


On 08/25/14 19:48, Phil Pennock wrote:
> On 2014-08-25 at 18:02 -0500, Lawrence K. Chen, P.Eng. wrote:
>> I don't know much about pagerduty, except one group on campus that shares our
>> Nagios server is using it.
>>
>> So, there's perl script to tie into nagios hasn't left a good impression on 
>> me.
> 
> On the bright side, it's Perl, and there's always _some_ sucker around
> who can read Perl.  (Although, here, it's ... me.  Damn.)  This means
> that you can understand what is going to be running on your monitoring
> system, which is worth a lot to me.
> 
True.... and compared to figuring out why DateTime::TimeZone >1.69 didn't
work in irssi was harder...


> 
>> And, a proxy is needed because the server is in private IP space (eventually
>> our entire datacenter will be....though sounds like it'll all be behind our
>> F5, but its been WIP for almost a couple of years now.)
> 
> That is not a sufficient reason to need a proxy.  You can use NAT to
> solve that problem.  What makes more sense is that you might not _want_
> all server instances to have unconstrained network access: if you can
> lock down which machines can talk where, then you can have more
> confidence about the communication patterns involved and what an
> attacker might or might not be able to do.  Running a SOCKS proxy and
> logging connections made will get you an event stream which can in
> theory be inspected for anomalies.
> 
NAT (or Secure NAT) would solve a lot of problems.... except that none of the
options that I have access to were available for this vlan.  Some of it
because its a vlan for servers that will have full access to all other vlans
in the datacenter, regardless of data classification.

There is also a move that someday we'll go to a default deny outbound on every
host....

Which I have dabbled with on some hosts in the past...with varied success.

What won't stop are the people that fix the problem by sticking in an allow
all (sometimes with comment that its temporary....but that I've come across
pre-date me.)

Like servers with PII (include our SSNs) definitely should not be accessible
from off campus....but a former co-worker had messed up firewalls on such
servers just before lunch, and his quick fix had been to allow any to any.

(and these servers sat in the old network that is outside any firewalls...when
I had started we were supposed to be working on moving everything off of that
network....but 8 years later...we still have many things in it.  Previous
managers had said that they might just request a new vlan that is all or part
of that IP space and just push all the servers into it....but it hasn't
happened, and I don't know enough to know why or why not.)

So, sometime after the co-worker had left, left.... for a hosting company that
is all about security, and apparently if they don't think you're serious
enough about security you can't be a customer of theirs.  I discovered this,
so immediately fixed the firewalls correctly.

Which broke at least one essential service that was running in the cloud, but
shouldn't have been.

I was then told that while what I did was the correct action, I'm in trouble
because it broke that service.....and in a recently post Virgina Tech....got
us in a lot of hot water, especially since I rejected the tickets to reopen
the servers to world.  We did eventually open it temporarily to a specific IP,
and I then pulled it as scheduled without waiting to see if they had actually
migrated the service completely back into our datacenter.....

Though recently, there have been numerous cases where I'll find that the
ticket got reassigned to somebody that overrode the restriction.  Still
leaving me to fix the mess that comes from it. (though I didn't put in any
overtime to eventually fix the visible part of the problem....forget where I
left the other part.)

> 
> CPAN on FreeBSD should be integrating into Ports, registering as an
> installed port.  I haven't looked at how CPAN/FreeBSD ties into package
> creation, instead of just registering something as having been
> installed.

BSDPAN hasn't been updated for pkgng. - not sure if it will:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187111

Had thought about looking under its covers to see what's actually involved,
but keep getting distracted with other things to fix.

Also probably not going to actually get in any patches to save ports I'm using
that haven't been staged.

> 
> I really strongly recommend taking a look at Poudriere, the new Ports
> building framework being used by the FreeBSD folks.  It uses jails,
> dynamically creating jails for builds, to fully isolate the Ports.
> Create an overlay ports tree to contain the things which you want
> packaged, and you're in good shape.  The only caveat is that you can't
> use bind mounts or symlinks to hold the overlay in place, since a
> read-only bind mount is used to expose the Ports tree into the build
> jails.  This is just good motivation to keep the overlay in git and
> auto-sync in whatever setup tool you use to manage the tree updates.
> 
I've been working on and off for the last couple of months trying to get
Poudriere going (along with Portshaker to merge my bits and other peoples bits
into a ports tree)

The main hang up (other than a disk failing and the 200+ hours to
resilver...was was originally estimating 400+ hours, but things picked up
along the way.  Follow replacing the other disk (looks like I'll need to
reboot to see the increased space.  And, ignoring the failing disk in a raidz
pool.... not sure how it came to be, that even though all my hard drives have
512-byte sectors....that I apparently had given enough though to make my
mirrored zpool be 4K aligned...but not my raidz/raidz2 pools.  Getting hard to
find drives to replace disks in the raidz pool (though the remaining 'good'
disk from the mirrored pool is the same size as disks in the raidz pool, so I
could use that now) has been to have the poudriere jails use most of the same
options I've set across the various make.conf's...and some thing similar with
the various port options.

Decided that 4 jails for each of the 4 servers wasn't what I wanted....settled
on 2 jails.  One for my two headless servers (which I had intended that they
both be the same...though there have been some drift...which I'm currently
working on resolving).  And, my two workstation servers....though that might
not be possible since there's quite a lot of difference between the
two....since they do quite different things in the background.

Like my home system is my policyserver for CFE3, its my SVN server (though its
project to convert to git and host that from my headless servers...in some HA
failover scheme...using the HAST volume.)  Though might just do something else
to get things to git (so tired of doing 'svn commit' and more than what I
intended getting committed....need to stop doing so much task
switching...perhaps I need to stop using -m "<life story>" with my commits,
though it was hard enough to break out of bad habit I had picked up at work
for just doing 'svn commit -m ""')

Plus there are other differences and desires for the two workstation
servers.... I suppose I'm a bit more of a risk taker with ports on my system
at home (even though its actually more important to work then my work one,
since I pop all my work email to it....so that I can better decide what's spam
and what's not and use more than 64k of message filters to sort it...  And,
then connect to it from work to keep up with my work email...)  Though having
my work system stay up all the time is important...for all the screen sessions
I have running on there.

Though one obvious difference is work system has mysql=5.6p...

> 
> Monitor process count on the monitoring box, alert when it goes too
> high.  You might not reach Pagerduty, but that's the sort of event, on
> an isolated box under your control, where you make sure that the contact
> for that service goes to multiple places, not just pagerduty.  Start
> screaming blue bloody murder.
> 

In the past, I had gotten comments on why there have been nagios alerts that
nagios is down....

But, as I mentioned its a shared server, and one of the groups using it is the
one using pagerduty.  I have no control on how they choose to get alerted or
not.  And, nothing to do with the fact that the sysadmin is my former $boss.

> 
> Why did they switch in one move, instead of using _both_ pagerduty _and_
> email?
> 

Probably one of the great mysteries that will never be solved....

> 
>> Hadn't really thought about our notifications from this Nagios server now
>> being dependent on our smtp server....our old server had been in the
>> datacenter range that is completely open to the world....so it did its own
>> mail delivery (especially important when it used to largely inform us of
>> problems with campus email...)  Though its getting hard for me to handle
>> notifications timely/safely....
> 
> See above: once you've debugged your integration and checked the failure
> modes, the active link and retries outside the store-and-forward result
> in more reliability.  I still have the emails too, but they're mostly
> just something to purge from my mail folders when they finally straggle
> in.
> 

Well, the pagerduty is specific to one group....the rest of us will still be
dependent on email....and worse the unpredictable email-to-SMS.  At least it
has gotten considerly less noisy (something my new manager had praised about
our configuration...which came out of some best practices session from my
first LISA ;)

And, I'm probably not concerned enough about it....though, we're supposed to
eventually have a capable NOC to deal with most notifications or escalate
appropriately from there.

Wonder when the purge of always down or irrelevant monitoring will take place?

Still don't know why we're monitoring the inter-library loan printer...or
who's supposed to act if it goes down. (probably something left over from when
LNS merged with CNS...well before I started here.)

Also when I started here...our nagios was running on a long ago former
sysadmin's desktop.  Along with constant grumbling that needed to get upgraded
and moved.... (something that I decided to just tackle on my own....following
my first contract renewal....and perhaps why I had gone to such sessions at my
first LISA...)


-- 
Who: Lawrence K. Chen, P.Eng. - W0LKC - Sr. Unix Systems Administrator
For: Enterprise Server Technologies (EST) -- & SafeZone Ally
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Re: [lopsa-discuss] Pager rotation for nagios

Reply via email to