Patrick M. wrote: > Hi all, > > I've been running Nagios 2.6 for about 6 months now, and every now and > then we get critical pages about a machine being down, or at least > Nagios can't connect to it. It causes the CEO to freak out and believe > something is up with our network. > > To me, it seems like the box is getting stressed out during the tests > and is causing the plugins to time out. > > Here's some of the alerts from this morning: > > ####################################### > [08-30-2007 09:24:10] HOST ALERT: tu.xyz.com;DOWN;SOFT;1;CRITICAL - > Plugin timed out after 10 seconds > Service Warning[08-30-2007 09:23:40] SERVICE ALERT: > pule.xyz.com;PING;WARNING;SOFT;1;PING WARNING - Packet loss = 44%, RTA = > 3.64 ms > ####################################### >
Are you noticing any slowdown in normal network traffic while all this is happening? Most of the checks that have timed out are ICMP-based. Assuming you're doing some wonky QoS-stuff (windows has that stuff built in...), it's not too hard to guess that ICMP is probably right at the bottom of the priority list. > > The machine is a p4 2.4 ghz with 1gb ram. > How many checks are you running / minute? It should be capable of handling 500 - 800 / minute without any problems at all. > I'm not sure how to troubleshoot this - any ideas? Check QoS settings in the network. If it's not that, try removing half the checks and see if that solves it. If it does, you've got either a really bad network or underdimensioned hardware. If it's more checks than ICMP-based ones that are acting up and you primarily see lots of false alarms within a short (10-30 seconds) window, make sure you haven't got your network card set to auto- negotiate transfer speed and duplex. I assume you haven't set the nagios server to obtain a dhcp-address, as renewing such a one can sometimes have funny impact on montoring, but while you're at it, make sure (by triple-checking) that there's only one machine with the IP of the monitoring machine. > What can I provide > you folks in order to help me out? > Money, or evidence of having tried things on your own. Both are hard currency when asking for help in a tech-savvy forum. -- Andreas Ericsson [EMAIL PROTECTED] OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null