[Nagios-users] Nagios Hang?
Im running Nagios 2.0 (Stable) on Redhat 9.0, in a distributed environment. Im utilizing NSCA for checks and all appears to be working properly. Im running into several issues that seemed to have started all of a sudden. 1) On my distributed server, I dont see syslog messages any longer, with the exception of INITIAL SERVICE STATE messages. Syslog is working, and in the nagios.cfg file, nagios.cfg:use_syslog=1 I used to see all the check messages, etc. Nothing in the configuration has changed to the best of my knowledge. 2) Nagios appears to hang on the remote sensor. Once I receive notifications that network devices are down, I never see a recovery of the network devices, even though they are recovered. The work around is to restart nagios with service nagios restart. Sometimes, this takes multiple tries. 3) When I have a massive network outage, I receive the appropriate alerts but I receive multiple PROBLEM notifications. Im only using service checks (Im only using check_ping currently) and the notification_interval set to 0, which according to the documentation should limit the amount of messages Im receiving to 1, unless Im using the service escalations, which I am not at this time. I am not receiving multiple notifications for OK messages, which is what I would expect. Sorry about the novel but these have frustrated me into drinking lots of beer. Mike
RE: [Nagios-users] Nagios Hang?
I thought of something else, something that HAS changed. Im now using NSCA across a firewall. Could this be the problem for #2? Thanks! Mike -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Mike Koponick Sent: Wednesday, February 15, 2006 8:10 AM To: Nagios Users Subject: [Nagios-users] Nagios Hang? Im running Nagios 2.0 (Stable) on Redhat 9.0, in a distributed environment. Im utilizing NSCA for checks and all appears to be working properly. Im running into several issues that seemed to have started all of a sudden. 1) On my distributed server, I dont see syslog messages any longer, with the exception of INITIAL SERVICE STATE messages. Syslog is working, and in the nagios.cfg file, nagios.cfg:use_syslog=1 I used to see all the check messages, etc. Nothing in the configuration has changed to the best of my knowledge. 2) Nagios appears to hang on the remote sensor. Once I receive notifications that network devices are down, I never see a recovery of the network devices, even though they are recovered. The work around is to restart nagios with service nagios restart. Sometimes, this takes multiple tries. 3) When I have a massive network outage, I receive the appropriate alerts but I receive multiple PROBLEM notifications. Im only using service checks (Im only using check_ping currently) and the notification_interval set to 0, which according to the documentation should limit the amount of messages Im receiving to 1, unless Im using the service escalations, which I am not at this time. I am not receiving multiple notifications for OK messages, which is what I would expect. Sorry about the novel but these have frustrated me into drinking lots of beer. Mike
RE: [Nagios-users] Nagios Hang?
-Original Message- From: Mike Koponick [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 15, 2006 10:33 AM To: Marc Powell; Nagios Users Subject: RE: [Nagios-users] Nagios Hang? [chop] Here are a couple of samples of my hosts/services from the sensor: [chop] define service { hostgroup_name Company_Switches service_description check_ping is_volatile 1 check_command check_ping!150.0,20%!200.0,60% [chop] Hosts/Services from the Central Server: [chop] define service { hostgroup_name Company_Switches service_description check_ping is_volatile 1 check_command check_stale [chop] You normally do not want to use 'is_volatile' for passive service checks. They should only be used for services that automatically reset themselves to an OK state after being checked and you wish to be notified _every_ time they're checked and not OK. This would account for your multiple notifications. http://nagios.sourceforge.net/docs/2_0/volatileservices.html Volatile services differ from normal services in three important ways. Each time they are checked when they are in a hard non-OK state, and the check returns a non-OK state (i.e. no state change has occurred)... * the non-OK service state is logged * contacts are notified about the problem (if that's what should be done) * the event handler for the service is run (if one has been defined) These events normally only occur for services when they are in a non-OK state and a hard state change has just occurred. In other words, they only happen the first time that a service goes into a non-OK state. If future checks of the service result in the same non-OK state, no hard state change occurs and none of the events mentioned take place again. -- Marc --- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnkkid3432bid#0486dat1642 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
RE: [Nagios-users] Nagios Hang?
Marc, Thanks for the is_volatile tip. As for the NSCA, I found further information: Feb 15 08:01:26 cmi-console nagios: Warning: OCSP command '/usr/local/nagios/libexec/submit_check_result PHM.2950.110.5 'check_ping' 'OK' 'PING OK - Packet loss = 0%, RTA = 6.81 ms' ''' for service 'check_ping' on host 'PHM.2950.110.5' timed out after 5 seconds Feb 15 08:10:09 cmi-console nagios: Warning: OCSP command '/usr/local/nagios/libexec/submit_check_result 2950-135 'check_ping' 'OK' 'PING OK - Packet loss = 0%, RTA = 3.43 ms' ''' for service 'check_ping' on host '2950-135' timed out after 5 seconds Feb 15 08:12:36 cmi-console nagios: Warning: OCSP command '/usr/local/nagios/libexec/submit_check_result 2950-34 'check_ping' 'OK' 'PING OK - Packet loss = 0%, RTA = 1.50 ms' ''' for service 'check_ping' on host '2950-34' timed out after 5 seconds This has the appearance that send_nsca is getting timed out. I don't see the connection ending up on the remote side. I wonder if this some type of weird NAT issue going on between the two firewalls. Although, I do see other NSCA traffic working just fine from the same server. Thanks! Mike -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Marc Powell Sent: Wednesday, February 15, 2006 8:51 AM To: Nagios Users Subject: RE: [Nagios-users] Nagios Hang? -Original Message- From: Mike Koponick [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 15, 2006 10:33 AM To: Marc Powell; Nagios Users Subject: RE: [Nagios-users] Nagios Hang? [chop] Here are a couple of samples of my hosts/services from the sensor: [chop] define service { hostgroup_name Company_Switches service_description check_ping is_volatile 1 check_command check_ping!150.0,20%!200.0,60% [chop] Hosts/Services from the Central Server: [chop] define service { hostgroup_name Company_Switches service_description check_ping is_volatile 1 check_command check_stale [chop] You normally do not want to use 'is_volatile' for passive service checks. They should only be used for services that automatically reset themselves to an OK state after being checked and you wish to be notified _every_ time they're checked and not OK. This would account for your multiple notifications. http://nagios.sourceforge.net/docs/2_0/volatileservices.html Volatile services differ from normal services in three important ways. Each time they are checked when they are in a hard non-OK state, and the check returns a non-OK state (i.e. no state change has occurred)... * the non-OK service state is logged * contacts are notified about the problem (if that's what should be done) * the event handler for the service is run (if one has been defined) These events normally only occur for services when they are in a non-OK state and a hard state change has just occurred. In other words, they only happen the first time that a service goes into a non-OK state. If future checks of the service result in the same non-OK state, no hard state change occurs and none of the events mentioned take place again. -- Marc --- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=kkid3432bid#0486dat1642 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null --- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnkkid3432bid#0486dat1642 ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null