Re: [Nagios-users] Host retry interval

2006-05-23 Thread Kyle Tucker
   for that service.  Ergo, if you set your service checks max to 15, after
   15 minutes (assuming your delay is 60 seconds) your service will hit a
   HARD CRITICAL, and host checks will fire.
 
  That's not correct, host checks are performed as soon as a service check
  returns some non-OK status.
 
 Yes thats right.
 A service never reaches the max_check_attempts if the host check return not=

Thanks for all the insight. And I assume ALL service checks are on hold as
soon as host check returns non-OK. What I think I'm finding is that some of
my SNMP-based service check scripts are hanging far too long and the Net-SNMP
daemon can only handle one connection at a time (doesn't fork a new daemon)
so all subsequent checks fail, to include my HOST check as it is a combination
of connecting to the snmpd and sshd daemons. The sshd connection is fine but
the snmpd times out. I do this as pinging wasn't possible to the systems I'm
monitoring as they're spread across the Internet. Should I set my service
check timeout to something outrageous (it's already at 90 seconds) or can I
tell Nagios to run serialized service checks for certain problematic hosts?

-- 
- Kyle 
-
[EMAIL PROTECTED]   http://www.panix.com/~kylet
-


---
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Host retry interval

2006-05-22 Thread Eli Stair

I'd suggest NOT taking this action at the host level; reason being that all
service checks are halted for the duration of this non-parallelized
action... You want to avoid doing a host check until absolutely necessary.
Suggest increasing the max_check_attempts on the SERVICE to a larger number,
this will avoid impacting the monitoring system as a whole.

/eli


On 5/22/06 11:45 AM, Kyle Tucker [EMAIL PROTECTED] wrote:

 Hi,
 I have many hosts that are constantly giving me DOWN/UP state as they
 are unreachable for certain periods. In an attempt to give the system more
 time to become available, I increased the max_check_attempts from 2 to 5. At
 2 the interval between retry attempts was 10 seconds. Now at 5, the interval
 is 7 seconds. I'd like to have this interval higher for some hosts, but
 there's a real scary note on the hosts check_interval option to not use it
 if you can help it. Is this my only option or is there a better way? I am
 also following the thread titled Workaround for 'Host DOWN' false-positives
 but I don't clearly see how to set that up. Here's some output for a failed
 host (I tail and parse the date field in a script so it's readable).
 
 Mon May 22 14:25:52 2006  HOST ALERT: badhost;DOWN;SOFT;1;* system down * -
 snmpd not responding
 Mon May 22 14:25:59 2006  HOST ALERT: badhost;DOWN;SOFT;2;* system down * -
 snmpd not responding
 Mon May 22 14:26:06 2006  HOST ALERT: badhost;DOWN;SOFT;3;* system down * -
 snmpd not responding
 Mon May 22 14:26:16 2006  HOST ALERT: badhost;DOWN;SOFT;4;* system down * -
 snmpd not responding
 Mon May 22 14:26:23 2006  HOST ALERT: badhost;DOWN;HARD;5;* system down * -
 snmpd not responding
 



---
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Host retry interval

2006-05-22 Thread Kyle Tucker
Thanks Eli,

According to the docs, Nagios checks the status of a host is when a service 
check results in a non-OK status.. Is that when it reaches a HARD state 
after all iterations of max_check_attempts are done or as soon as it goes to 
non-OK (SOFT) state? If the latter, which seems to be when it's kicking off 
host checks, I don't see how increasing the service checks will help. 

 I'd suggest NOT taking this action at the host level; reason being that all
 service checks are halted for the duration of this non-parallelized
 action... You want to avoid doing a host check until absolutely necessary.
 Suggest increasing the max_check_attempts on the SERVICE to a larger number,
 this will avoid impacting the monitoring system as a whole.
 
 /eli
 
 
 On 5/22/06 11:45 AM, Kyle Tucker [EMAIL PROTECTED] wrote:
 
  Hi,
  I have many hosts that are constantly giving me DOWN/UP state as they
  are unreachable for certain periods. In an attempt to give the system more
  time to become available, I increased the max_check_attempts from 2 to 5. At
  2 the interval between retry attempts was 10 seconds. Now at 5, the interval
  is 7 seconds. I'd like to have this interval higher for some hosts, but
  there's a real scary note on the hosts check_interval option to not use it
  if you can help it. Is this my only option or is there a better way? I am
  also following the thread titled Workaround for 'Host DOWN' 
  false-positives
  but I don't clearly see how to set that up. Here's some output for a failed
  host (I tail and parse the date field in a script so it's readable).
  
  Mon May 22 14:25:52 2006  HOST ALERT: badhost;DOWN;SOFT;1;* system down * -
  snmpd not responding
  Mon May 22 14:25:59 2006  HOST ALERT: badhost;DOWN;SOFT;2;* system down * -
  snmpd not responding
  Mon May 22 14:26:06 2006  HOST ALERT: badhost;DOWN;SOFT;3;* system down * -
  snmpd not responding
  Mon May 22 14:26:16 2006  HOST ALERT: badhost;DOWN;SOFT;4;* system down * -
  snmpd not responding
  Mon May 22 14:26:23 2006  HOST ALERT: badhost;DOWN;HARD;5;* system down * -
  snmpd not responding
  
 


-- 
- Kyle 
-
[EMAIL PROTECTED]   http://www.panix.com/~kylet
-


---
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Host retry interval

2006-05-22 Thread Holger Weiss
* Kyle Tucker [EMAIL PROTECTED] [2006-05-22 14:45]:
   I have many hosts that are constantly giving me DOWN/UP state as they
 are unreachable for certain periods. In an attempt to give the system more
 time to become available, I increased the max_check_attempts from 2 to 5. At
 2 the interval between retry attempts was 10 seconds. Now at 5, the interval
 is 7 seconds. I'd like to have this interval higher for some hosts, but 
 there's a real scary note on the hosts check_interval option to not use it
 if you can help it.

Using `check_interval' within a host definition would activate regularly
scheduled checks of the host as opposed to checking the host on-demand
only.  The directive you'd want is `retry_check_interval', but such a
thing doesn't exist for host checks.

 Is this my only option or is there a better way?

With Nagios 2.x, there is no real solution, you can only use
workarounds such as defining host escalations in order to suppress the
first notification.  AFAICS, this will be solved with Nagios 3.x.

 I am also following the thread titled Workaround for 'Host DOWN'
 false-positives

This might be another useful workaround, haven't looked at it yet.

Holger

-- 
PGP fingerprint:  F1F0 9071 8084 A426 DD59  9839 59D3 F3A1 B8B5 D3DE


---
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Host retry interval

2006-05-22 Thread Joerg Linge
Am Montag 22 Mai 2006 22:44 schrieb Holger Weiss:
 * Eli Stair [EMAIL PROTECTED] [2006-05-22 12:49]:

  for that service.  Ergo, if you set your service checks max to 15, after
  15 minutes (assuming your delay is 60 seconds) your service will hit a
  HARD CRITICAL, and host checks will fire.

 That's not correct, host checks are performed as soon as a service check
 returns some non-OK status.

Yes thats right.
A service never reaches the max_check_attempts if the host check return not OK

Jörg


pgpvmOCCLuM9I.pgp
Description: PGP signature