[Nagios-users] Nagios Hang?

2006-02-15 Thread Mike Koponick










Im running Nagios 2.0 (Stable) on Redhat 9.0, in a distributed
environment. Im utilizing NSCA for checks and all appears to be working
properly.



Im running into several issues that seemed to have started
all of a sudden.



1)
On my distributed server, I dont see syslog
messages any longer, with the exception of INITIAL SERVICE STATE
messages. Syslog is working, and in the nagios.cfg file, nagios.cfg:use_syslog=1
I used to see all the check messages, etc. Nothing in the configuration has changed
to the best of my knowledge. 



2)
Nagios appears to hang on the remote
sensor. Once I receive notifications that network devices are down, I never see
a recovery of the network devices, even though they are recovered. The work
around is to restart nagios with service nagios restart.
Sometimes, this takes multiple tries. 



3)
When I have a massive network outage, I receive the
appropriate alerts but I receive multiple PROBLEM notifications. Im
only using service checks (Im only using check_ping currently) and the notification_interval
set to 0, which according to the documentation should limit the amount
of messages Im receiving to 1, unless Im using the
service escalations, which I am not at this time. I am not receiving multiple
notifications for OK messages, which is what I would expect.





Sorry about the novel but these have frustrated me into drinking
lots of beer.



Mike










RE: [Nagios-users] Nagios Hang?

2006-02-15 Thread Mike Koponick








I thought of something else, something that
HAS changed. Im now using NSCA across a firewall. Could this be the problem
for #2?



Thanks!



Mike



-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Mike Koponick
Sent: Wednesday,
 February 15, 2006 8:10 AM
To: Nagios Users
Subject: [Nagios-users] Nagios
Hang?





Im running Nagios 2.0
(Stable) on Redhat 9.0, in a distributed environment. Im utilizing NSCA
for checks and all appears to be working properly.



Im running into several
issues that seemed to have started all of a sudden.



1) On my
distributed server, I dont see syslog messages any longer, with the
exception of INITIAL SERVICE STATE
messages. Syslog is working, and in the nagios.cfg file,
nagios.cfg:use_syslog=1 I used to see all the check messages,
etc. Nothing in the configuration has changed to the best of my knowledge. 



2) Nagios
appears to hang on the remote sensor. Once I receive
notifications that network devices are down, I never see a recovery of the
network devices, even though they are recovered. The work around is to restart
nagios with service nagios restart. Sometimes, this takes
multiple tries. 



3) When I have
a massive network outage, I receive the appropriate alerts but I receive
multiple PROBLEM notifications. Im only using service checks
(Im only using check_ping currently) and the notification_interval set
to 0, which according to the documentation should limit the
amount of messages Im receiving to 1, unless Im
using the service escalations, which I am not at this time. I am not receiving
multiple notifications for OK messages, which is what I would
expect.





Sorry about the novel but these have
frustrated me into drinking lots of beer.



Mike










RE: [Nagios-users] Nagios Hang?

2006-02-15 Thread Marc Powell


 -Original Message-
 From: Mike Koponick [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, February 15, 2006 10:33 AM
 To: Marc Powell; Nagios Users
 Subject: RE: [Nagios-users] Nagios Hang?
 

[chop]

 Here are a couple of samples of my hosts/services from the sensor:
 

[chop]

 define  service {
 hostgroup_name  Company_Switches
 service_description check_ping
 is_volatile 1
 check_command   check_ping!150.0,20%!200.0,60%

[chop]

 


 
 
 
 Hosts/Services from the Central Server:

[chop]

 
 define  service {
 hostgroup_name  Company_Switches
 service_description check_ping
 is_volatile 1
 check_command   check_stale
[chop]

You normally do not want to use 'is_volatile' for passive service
checks. They should only be used for services that automatically reset
themselves to an OK state after being checked and you wish to be
notified _every_ time they're checked and not OK. This would account for
your multiple notifications.

 http://nagios.sourceforge.net/docs/2_0/volatileservices.html

Volatile services differ from normal services in three important ways.
Each time they are checked when they are in a hard non-OK state, and the
check returns a non-OK state (i.e. no state change has occurred)...

* the non-OK service state is logged
* contacts are notified about the problem (if that's what should be
done)
* the event handler for the service is run (if one has been defined)


These events normally only occur for services when they are in a non-OK
state and a hard state change has just occurred. In other words, they
only happen the first time that a service goes into a non-OK state. If
future checks of the service result in the same non-OK state, no hard
state change occurs and none of the events mentioned take place again.  

--
Marc


---
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnkkid3432bid#0486dat1642
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue.
::: Messages without supporting info will risk being sent to /dev/null


RE: [Nagios-users] Nagios Hang?

2006-02-15 Thread Mike Koponick
Marc,

Thanks for the is_volatile tip. 

As for the NSCA, I found further information:

Feb 15 08:01:26 cmi-console nagios: Warning: OCSP command
'/usr/local/nagios/libexec/submit_check_result PHM.2950.110.5
'check_ping' 'OK' 'PING OK - Packet loss = 0%, RTA = 6.81 ms' ''' for
service 'check_ping' on host 'PHM.2950.110.5' timed out after 5 seconds

Feb 15 08:10:09 cmi-console nagios: Warning: OCSP command
'/usr/local/nagios/libexec/submit_check_result 2950-135 'check_ping'
'OK' 'PING OK - Packet loss = 0%, RTA = 3.43 ms' ''' for service
'check_ping' on host '2950-135' timed out after 5 seconds

Feb 15 08:12:36 cmi-console nagios: Warning: OCSP command
'/usr/local/nagios/libexec/submit_check_result 2950-34 'check_ping' 'OK'
'PING OK - Packet loss = 0%, RTA = 1.50 ms' ''' for service 'check_ping'
on host '2950-34' timed out after 5 seconds


This has the appearance that send_nsca is getting timed out. I don't see
the connection ending up on the remote side.

I wonder if this some type of weird NAT issue going on between the two
firewalls. Although, I do see other NSCA traffic working just fine from
the same server.

Thanks!

Mike


-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Marc
Powell
Sent: Wednesday, February 15, 2006 8:51 AM
To: Nagios Users
Subject: RE: [Nagios-users] Nagios Hang?



 -Original Message-
 From: Mike Koponick [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, February 15, 2006 10:33 AM
 To: Marc Powell; Nagios Users
 Subject: RE: [Nagios-users] Nagios Hang?
 

[chop]

 Here are a couple of samples of my hosts/services from the sensor:
 

[chop]

 define  service {
 hostgroup_name  Company_Switches
 service_description check_ping
 is_volatile 1
 check_command   check_ping!150.0,20%!200.0,60%

[chop]

 


 
 
 
 Hosts/Services from the Central Server:

[chop]

 
 define  service {
 hostgroup_name  Company_Switches
 service_description check_ping
 is_volatile 1
 check_command   check_stale
[chop]

You normally do not want to use 'is_volatile' for passive service
checks. They should only be used for services that automatically reset
themselves to an OK state after being checked and you wish to be
notified _every_ time they're checked and not OK. This would account for
your multiple notifications.

 http://nagios.sourceforge.net/docs/2_0/volatileservices.html

Volatile services differ from normal services in three important ways.
Each time they are checked when they are in a hard non-OK state, and the
check returns a non-OK state (i.e. no state change has occurred)...

* the non-OK service state is logged
* contacts are notified about the problem (if that's what should be
done)
* the event handler for the service is run (if one has been defined)


These events normally only occur for services when they are in a non-OK
state and a hard state change has just occurred. In other words, they
only happen the first time that a service goes into a non-OK state. If
future checks of the service result in the same non-OK state, no hard
state change occurs and none of the events mentioned take place again.  

--
Marc


---
This SF.net email is sponsored by: Splunk Inc. Do you grep through log
files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=kkid3432bid#0486dat1642
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when
reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


---
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnkkid3432bid#0486dat1642
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue.
::: Messages without supporting info will risk being sent to /dev/null