Re: [Nagios-users] Host retry interval
for that service. Ergo, if you set your service checks max to 15, after 15 minutes (assuming your delay is 60 seconds) your service will hit a HARD CRITICAL, and host checks will fire. That's not correct, host checks are performed as soon as a service check returns some non-OK status. Yes thats right. A service never reaches the max_check_attempts if the host check return not= Thanks for all the insight. And I assume ALL service checks are on hold as soon as host check returns non-OK. What I think I'm finding is that some of my SNMP-based service check scripts are hanging far too long and the Net-SNMP daemon can only handle one connection at a time (doesn't fork a new daemon) so all subsequent checks fail, to include my HOST check as it is a combination of connecting to the snmpd and sshd daemons. The sshd connection is fine but the snmpd times out. I do this as pinging wasn't possible to the systems I'm monitoring as they're spread across the Internet. Should I set my service check timeout to something outrageous (it's already at 90 seconds) or can I tell Nagios to run serialized service checks for certain problematic hosts? -- - Kyle - [EMAIL PROTECTED] http://www.panix.com/~kylet - --- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Host retry interval
I'd suggest NOT taking this action at the host level; reason being that all service checks are halted for the duration of this non-parallelized action... You want to avoid doing a host check until absolutely necessary. Suggest increasing the max_check_attempts on the SERVICE to a larger number, this will avoid impacting the monitoring system as a whole. /eli On 5/22/06 11:45 AM, Kyle Tucker [EMAIL PROTECTED] wrote: Hi, I have many hosts that are constantly giving me DOWN/UP state as they are unreachable for certain periods. In an attempt to give the system more time to become available, I increased the max_check_attempts from 2 to 5. At 2 the interval between retry attempts was 10 seconds. Now at 5, the interval is 7 seconds. I'd like to have this interval higher for some hosts, but there's a real scary note on the hosts check_interval option to not use it if you can help it. Is this my only option or is there a better way? I am also following the thread titled Workaround for 'Host DOWN' false-positives but I don't clearly see how to set that up. Here's some output for a failed host (I tail and parse the date field in a script so it's readable). Mon May 22 14:25:52 2006 HOST ALERT: badhost;DOWN;SOFT;1;* system down * - snmpd not responding Mon May 22 14:25:59 2006 HOST ALERT: badhost;DOWN;SOFT;2;* system down * - snmpd not responding Mon May 22 14:26:06 2006 HOST ALERT: badhost;DOWN;SOFT;3;* system down * - snmpd not responding Mon May 22 14:26:16 2006 HOST ALERT: badhost;DOWN;SOFT;4;* system down * - snmpd not responding Mon May 22 14:26:23 2006 HOST ALERT: badhost;DOWN;HARD;5;* system down * - snmpd not responding --- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Host retry interval
Thanks Eli, According to the docs, Nagios checks the status of a host is when a service check results in a non-OK status.. Is that when it reaches a HARD state after all iterations of max_check_attempts are done or as soon as it goes to non-OK (SOFT) state? If the latter, which seems to be when it's kicking off host checks, I don't see how increasing the service checks will help. I'd suggest NOT taking this action at the host level; reason being that all service checks are halted for the duration of this non-parallelized action... You want to avoid doing a host check until absolutely necessary. Suggest increasing the max_check_attempts on the SERVICE to a larger number, this will avoid impacting the monitoring system as a whole. /eli On 5/22/06 11:45 AM, Kyle Tucker [EMAIL PROTECTED] wrote: Hi, I have many hosts that are constantly giving me DOWN/UP state as they are unreachable for certain periods. In an attempt to give the system more time to become available, I increased the max_check_attempts from 2 to 5. At 2 the interval between retry attempts was 10 seconds. Now at 5, the interval is 7 seconds. I'd like to have this interval higher for some hosts, but there's a real scary note on the hosts check_interval option to not use it if you can help it. Is this my only option or is there a better way? I am also following the thread titled Workaround for 'Host DOWN' false-positives but I don't clearly see how to set that up. Here's some output for a failed host (I tail and parse the date field in a script so it's readable). Mon May 22 14:25:52 2006 HOST ALERT: badhost;DOWN;SOFT;1;* system down * - snmpd not responding Mon May 22 14:25:59 2006 HOST ALERT: badhost;DOWN;SOFT;2;* system down * - snmpd not responding Mon May 22 14:26:06 2006 HOST ALERT: badhost;DOWN;SOFT;3;* system down * - snmpd not responding Mon May 22 14:26:16 2006 HOST ALERT: badhost;DOWN;SOFT;4;* system down * - snmpd not responding Mon May 22 14:26:23 2006 HOST ALERT: badhost;DOWN;HARD;5;* system down * - snmpd not responding -- - Kyle - [EMAIL PROTECTED] http://www.panix.com/~kylet - --- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Host retry interval
* Kyle Tucker [EMAIL PROTECTED] [2006-05-22 14:45]: I have many hosts that are constantly giving me DOWN/UP state as they are unreachable for certain periods. In an attempt to give the system more time to become available, I increased the max_check_attempts from 2 to 5. At 2 the interval between retry attempts was 10 seconds. Now at 5, the interval is 7 seconds. I'd like to have this interval higher for some hosts, but there's a real scary note on the hosts check_interval option to not use it if you can help it. Using `check_interval' within a host definition would activate regularly scheduled checks of the host as opposed to checking the host on-demand only. The directive you'd want is `retry_check_interval', but such a thing doesn't exist for host checks. Is this my only option or is there a better way? With Nagios 2.x, there is no real solution, you can only use workarounds such as defining host escalations in order to suppress the first notification. AFAICS, this will be solved with Nagios 3.x. I am also following the thread titled Workaround for 'Host DOWN' false-positives This might be another useful workaround, haven't looked at it yet. Holger -- PGP fingerprint: F1F0 9071 8084 A426 DD59 9839 59D3 F3A1 B8B5 D3DE --- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Host retry interval
Am Montag 22 Mai 2006 22:44 schrieb Holger Weiss: * Eli Stair [EMAIL PROTECTED] [2006-05-22 12:49]: for that service. Ergo, if you set your service checks max to 15, after 15 minutes (assuming your delay is 60 seconds) your service will hit a HARD CRITICAL, and host checks will fire. That's not correct, host checks are performed as soon as a service check returns some non-OK status. Yes thats right. A service never reaches the max_check_attempts if the host check return not OK Jörg pgpvmOCCLuM9I.pgp Description: PGP signature