[Nagios-users] Nagios Sending Notifications to Contacts not in Escalation Config (Bug?)
Hi guys! I am now officially baffled on how Nagios handles service escalations and notifications. I'm using Nagios 3.2.3 on SLES 10 SP3 and my current setup is this: service_escalation.cfg: define serviceescalation { service_description http_80 host_name apache02 first_notification 1 last_notification 5 notification_interval 60 escalation_period Office_Hours contact_groups unix-sms, dba-email, dev-email } define serviceescalation { service_description http_80 host_name apache02 first_notification 6 last_notification 8 notification_interval 90 escalation_period Office_Hours contact_groups unix-sms, dba-email, dev-email, unix-supervisor, dev-supervisor } define serviceescalation { service_description http_80 host_name apache02 first_notification 1 last_notification 0 notification_interval 60 escalation_period 24x7 contact_groups unix-admins-email } The users defined in the service_escalation.cfg have their contacts.cfg configured like this: define contact{ contact_nameunix-sms alias Team UNIX host_notification_periodEarly_Morning service_notification_periodEarly_Morning host_notification_options u,d,r service_notification_optionsw,c,u,r host_notification_commands host-notify-by-epager service_notification_commands notify-by-epager email u...@email.org } define contact{ contact_nameunix-supervisor alias Team UNIX Supervisor host_notification_periodEarly_Morning service_notification_periodEarly_Morning host_notification_options u,d,r service_notification_optionsw,c,u,r host_notification_commands host-notify-by-epager service_notification_commands notify-by-epager email unixsupervi...@email.org } timeperiod.cfg looks like this: define timeperiod{ timeperiod_name Office_Hours alias Office_Hours sunday 09:00-20:00 monday 09:00-20:00 tuesday 09:00-20:00 wednesday 09:00-20:00 thursday09:00-20:00 friday 09:00-20:00 saturday09:00-20:00 } define timeperiod{ timeperiod_name Early_Morning alias Early_Morning sunday 07:00-22:10 monday 07:00-22:10 tuesday 07:00-22:10 wednesday 07:00-22:10 thursday07:00-22:10 friday 07:00-22:10 saturday07:00-22:10 } With these configurations in place, http_80 service goes down at 10pm every night (scheduled downtime). I am expecting that notifications starting from 10pm onwards will go *only* to unix-admins-email because of the service_escalation.cfg file. And it happily did, at least for the critical notifications. Now the fun part comes in. The recovery notification was sent to the unix-sms, dba-email, dev-email, unix-supervisor, dev-supervisor groups at 7:03am, when it returned to OK status, which is weird because the critical notifications from 10pm to 6am (next day) was sent only and only to the unix-admins-email group. Plus, I read from the Nagios docs that it will not send recovery notifications to those who did not receive the critical/warning/unknown notifications in the first place. So my questions are: Why did Nagios send the recovery alert to the supervisors, who did not know that the service was down in the first place because they did not receive the critical alert? Did Nagios took their defined timeperiods into consideration when it send the recovery alert? TIA! -- RSA(R) Conference 2012 Save $700 by Nov 18 Register now http://p.sf.net/sfu/rsa-sfdev2dev1___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] What the best way to monitor Windows?
I've got a Nagios installation all up and running and working just fine. Now I have to start monitoring some Windows Servers. There are so many different plugins. What are the recommendations for doing this? We have some Win Server 2003, 2008 as well as Exchange and SQL. Thanks NSClient++ worked fine for us. -- BlackBerryreg; DevCon Americas, Oct. 18-20, San Francisco, CA The must-attend event for mobile developers. Connect with experts. Get tools for creating Super Apps. See the latest technologies. Sessions, hands-on labs, demos much more. Register early save! http://p.sf.net/sfu/rim-blackberry-1___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Return code of 141 is out of bounds Error in Nagios 3.2.3
I finally figured this one out. The reason why the plugin was spewing out the out of bounds error was because of the underlying performance issue of the server. When the server becomes slow to respond, Nagios throws this error. The hint provided by Sven proved that it was indeed the disk I/O issue that threw everything out of balance, maybe because the plugin terminates before it could finish what it was doing. Replacing the disk with a faster one solved this issue. Case closed. Thanks to all who helped! Rai -- All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Return code of 141 is out of bounds Error in Nagios 3.2.3
Hi everyone, I just installed a fresh Nagios v3.2.3 with about 150 hosts and 600 services. I just noticed from time to time, hosts are throwing out Return code of 141 is out of bounds status every now and then, then it will eventually go away. I don't know if this has anything to do with the plugin since the status will return to OK state without intervention, which proves that the check_icmp plugin works just fine. I'm confused with this error, and this one did not manifest itself when we were using Nagios v2. Anyone has the same issue? Big thanks, Rai -- EditLive Enterprise is the world's most technically advanced content authoring tool. Experience the power of Track Changes, Inline Image Editing and ensure content is compliant with Accessibility Checking. http://p.sf.net/sfu/ephox-dev2dev___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Nagios Issue of not detecting down/up status of server and delay of mails notifications
Have you checked your time periods? On Fri, Jun 17, 2011 at 10:42 PM, Manish Kumar manikuma...@gmail.comwrote: Hi Friends, I have implemented Nagios Core-3.2.3 on Fedora Core 14 and configured it for monitoring of around 230 network devices for different services like up/down status, uptime, ports link status, similarly configured for monitoring around 30 windows servers for different services. The problem is that nagios is not able to send e-mail notification as soon as any server/service/network device goes down. Some of the e-mails are delayed around 12 hours and some are not triggered even, for example if a server goes down for around a hour and is up again after that, nagios is not able to detect it. I have setup the fedora 14 to use its sendmail sever to relay the mails to our corporate smtp sever. Is this a valid issue with nagios or is there any way to scale it up. How can a network/sever admin can believe on it if this works like this. -- Thanks Manish Kumar http://in.linkedin.com/in/manishkumar85 http://cens.cdac.in/ -- EditLive Enterprise is the world's most technically advanced content authoring tool. Experience the power of Track Changes, Inline Image Editing and ensure content is compliant with Accessibility Checking. http://p.sf.net/sfu/ephox-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null -- EditLive Enterprise is the world's most technically advanced content authoring tool. Experience the power of Track Changes, Inline Image Editing and ensure content is compliant with Accessibility Checking. http://p.sf.net/sfu/ephox-dev2dev___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Return code of 141 is out of bounds Error in Nagios 3.2.3
The output returns OK status when run manually. It seems that the error occurs at random times, but as mentioned, will eventually go away. If the plugin is the issue, the error should be persistent. In my case, it happens from time to time. I only experienced this when we used Nagios 3.2.3, this never happened in Nagios v2.6 On Mon, Jun 20, 2011 at 10:16 AM, Yueh-Hung Liu yuehung@gmail.comwrote: nagios only accepts integers 0~3 as return codes of plugins. try to manually execute the command of the questioned service (be the user nagios runs as) and check the ouputs. On Mon, Jun 20, 2011 at 9:24 AM, Rai Ricafrente maill...@ricafrente.com wrote: Hi everyone, I just installed a fresh Nagios v3.2.3 with about 150 hosts and 600 services. I just noticed from time to time, hosts are throwing out Return code of 141 is out of bounds status every now and then, then it will eventually go away. I don't know if this has anything to do with the plugin since the status will return to OK state without intervention, which proves that the check_icmp plugin works just fine. I'm confused with this error, and this one did not manifest itself when we were using Nagios v2. Anyone has the same issue? Big thanks, Rai -- EditLive Enterprise is the world's most technically advanced content authoring tool. Experience the power of Track Changes, Inline Image Editing and ensure content is compliant with Accessibility Checking. http://p.sf.net/sfu/ephox-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null -- EditLive Enterprise is the world's most technically advanced content authoring tool. Experience the power of Track Changes, Inline Image Editing and ensure content is compliant with Accessibility Checking. http://p.sf.net/sfu/ephox-dev2dev ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null -- EditLive Enterprise is the world's most technically advanced content authoring tool. Experience the power of Track Changes, Inline Image Editing and ensure content is compliant with Accessibility Checking. http://p.sf.net/sfu/ephox-dev2dev___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] Return code of 141 is out of bounds Error in Nagios 3.2.3
Hi Allan, I'm just saying that this could be a bug in the current Nagios, or the plugin as Terry pointed out, since this was never present in the previous version that I was using. This really freaks me out. I enabled Nagios's debug mode in the hopes of getting more about this. I am also considering re-compiling the whole thing. There is a nasty little bugger out there. Thanks! -- EditLive Enterprise is the world's most technically advanced content authoring tool. Experience the power of Track Changes, Inline Image Editing and ensure content is compliant with Accessibility Checking. http://p.sf.net/sfu/ephox-dev2dev___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] current_state value in status.dat file
I am looking at the status.dat file and I noticed that the current_state is =1 even when the host is down and is in critical state. I assume that if a host is down, the current_state should be =2. I am not sure how Nagios sets the current_status but in my case, the host is definitely off: admin1@serverr1:~ ping userver02 PING userver02.localhost (192.168.0.11) 56(84) bytes of data. --- userver02.localhost ping statistics --- 7 packets transmitted, 0 received, 100% packet loss, time 6012ms So why does the status.dat file say that the current_status=1? Does it mean that the host is in Warning state only, and not in Critical as opposed to what I am seeing in the web console? Anyway, I am bothered by this because I am using the check_summary.pl plugin ( http://exchange.nagios.org/directory/Plugins/Network-and-Systems-Management/Nagios/check_summary/details) and it reads the status.dat file. Btw, the status.dat file looks like this: hoststatus { host_name=userver02 modified_attributes=0 check_command=check-host-alive check_period=24x7 notification_period=24x7 check_interval=5.00 retry_interval=1.00 event_handler= has_been_checked=1 should_be_scheduled=1 check_execution_time=2.114 check_latency=0.239 check_type=0 current_state=1 last_hard_state=1 last_event_id=505 current_event_id=509 current_problem_id=232 last_problem_id=0 plugin_output=CRITICAL - userver02.localhost: rta nan, lost 100% long_plugin_output= performance_data=rta=0.000ms;800.000;1000.000;0; pl=100%;80;100;; rtmax=0.000ms rtmin=0.000ms last_check=1300344149 next_check=1300344459 check_options=0 current_attempt=1 max_attempts=2 current_event_id=509 last_event_id=505 state_type=1 last_state_change=1299383163 last_hard_state_change=1299383163 last_time_up=0 last_time_down=1300344159 last_time_unreachable=0 last_notification=1300344079 next_notification=1300347679 no_more_notifications=0 current_notification_number=267 current_notification_id=2857 notifications_enabled=1 problem_has_been_acknowledged=0 acknowledgement_type=0 active_checks_enabled=1 passive_checks_enabled=1 event_handler_enabled=1 flap_detection_enabled=1 failure_prediction_enabled=1 process_performance_data=1 obsess_over_host=1 last_update=1300344223 is_flapping=0 percent_state_change=0.00 scheduled_downtime_depth=0 } -- Colocation vs. Managed Hosting A question and answer guide to determining the best fit for your organization - today and in the future. http://p.sf.net/sfu/internap-sfd2d___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] current_state value in status.dat file
Thanks for pointing the doc. I've been going through that and I guess I was looking for the part where it says that DOWN = 1 and not 2. On Thu, Mar 17, 2011 at 3:40 PM, Giacomo Montagner gmontag...@sorint.itwrote: On Thu, 17 Mar 2011 14:57:08 +0800 Rai Ricafrente maill...@ricafrente.com wrote: I am looking at the status.dat file and I noticed that the current_state is =1 even when the host is down and is in critical state. I assume that if a host is down, the current_state should be =2. It's probably because hosts can be in only 3 states: UP=0 DOWN=1 UNREACHABLE=2 ( http://nagios.sourceforge.net/docs/3_0/networkreachability.html) Giacomo -- Colocation vs. Managed Hosting A question and answer guide to determining the best fit for your organization - today and in the future. http://p.sf.net/sfu/internap-sfd2d ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null -- Colocation vs. Managed Hosting A question and answer guide to determining the best fit for your organization - today and in the future. http://p.sf.net/sfu/internap-sfd2d___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null