Brodie, Kent wrote:
> Example:  in my early days of sysadminning (a VMS server with a
> disk-hungry mumps application) --  a database disk filled up overnight.
> It was...   bad.    I dealt with it, got things running, and then to
> prevent that from biting me again, I added proactive disk monitoring
> alerts - so that if or when things got out of control, I could be
> automatically paged to investigate and handle the errant process or
> whatever, BEFORE bad things happened.   (In today's world, we all do
> that with nagios/hobbit/whatever...)
*touch wood* the majority of things that ever go wrong for me that 
aren't as a consequence of me making a mistake are increasingly more 
exotic things that manage to squeeze outside the normal monitoring 
parameters, where things are working and will continue to work for a 
while but aren't actually right.
A classic example came up a few months back.  I'd not long set up Zabbix 
to start pulling server metrics I could drill down into and I spotted 
one quad-core system had been constantly at 24% CPU usage since I'd 
started monitoring it, and had recently bumped up to 49%.  An Apache 
Tomcat instance had had a single thread spinning away and now appeared 
to have two threads spinning.  Tomcat was still receiving and processing 
new requests so no alerts were going off, and 24% was hardly a CPU usage 
that would trigger off an alert (nor was 49% for that matter).  All the 
standard CPU alerts on both Nagios and Zabbix were still really only 
based around the idea of a single processor.  If I'd had 1 core, and 98% 
I'd have been alerted.  Queue an overhaul of CPU monitoring to make sure 
each core was monitored individually, along with some better "Sustained 
usage" alerts.
_______________________________________________
Discuss mailing list
[email protected]
http://lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to