Brodie, Kent wrote:
> Example: in my early days of sysadminning (a VMS server with a
> disk-hungry mumps application) -- a database disk filled up overnight.
> It was... bad. I dealt with it, got things running, and then to
> prevent that from biting me again, I added proactive disk monitoring
> alerts - so that if or when things got out of control, I could be
> automatically paged to investigate and handle the errant process or
> whatever, BEFORE bad things happened. (In today's world, we all do
> that with nagios/hobbit/whatever...)
*touch wood* the majority of things that ever go wrong for me that
aren't as a consequence of me making a mistake are increasingly more
exotic things that manage to squeeze outside the normal monitoring
parameters, where things are working and will continue to work for a
while but aren't actually right.
A classic example came up a few months back. I'd not long set up Zabbix
to start pulling server metrics I could drill down into and I spotted
one quad-core system had been constantly at 24% CPU usage since I'd
started monitoring it, and had recently bumped up to 49%. An Apache
Tomcat instance had had a single thread spinning away and now appeared
to have two threads spinning. Tomcat was still receiving and processing
new requests so no alerts were going off, and 24% was hardly a CPU usage
that would trigger off an alert (nor was 49% for that matter). All the
standard CPU alerts on both Nagios and Zabbix were still really only
based around the idea of a single processor. If I'd had 1 core, and 98%
I'd have been alerted. Queue an overhaul of CPU monitoring to make sure
each core was monitored individually, along with some better "Sustained
usage" alerts.
_______________________________________________
Discuss mailing list
[email protected]
http://lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/