Re: [Nagios-users] host-down notification can take 50 mins to be sent
I suggest you wait since 3.x alpha is out now but I've had worse probs with this one so I reverted back to the latest stable release. I've also been running 2.0a for another site and have never had this problem with this version before. So I'd be interested to see your results with 2.9 or 3.x thx for your all help On 6/15/07, Jim Avery <[EMAIL PROTECTED]> wrote: On 15/06/07, stucky <[EMAIL PROTECTED]> wrote: > Jim > > I'm confused > > 1. Nagios 2.9 comes with flapping turned off globally by default : > > # Values: 1 = enable flap detection > # 0 = disable flap detection (default) > > enable_flap_detection=0 > > 2. It also comes with check_for_orphaned_services=1 > > 3. Most importantly it comes with a localhost.cfg file that has 2 nested > host templates from the start. > > One called 'generic-host' and one called 'linux-server' which uses > 'generic-host' > Then it has a host description that uses 'linux-host' so we have 3 levels of > recursion right from the start. > It does the same thing with service templates. > > I assume you must not have looked at the defaults at all or just changed it > back to just one template. > I never used more than one before either but since the default configs > suggest it I figured it'd be ok. I confess I don't use Nagios 2.9 yet. Maybe it's time I should! I'm hoping to buy a new server for Nagios soon, as the old one is creaking under the strain of all the active checks and rrd databases. When that arrives I'll drag myself back up to the cutting edge. I haven't looked at the defaults for 18 months or so. cheers, Jim - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null -- stucky - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] Fwd: host-down notification can take 50 mins to be sent
-- Forwarded message -- From: stucky <[EMAIL PROTECTED]> Date: Jun 15, 2007 11:17 AM Subject: Re: [Nagios-users] host-down notification can take 50 mins to be sent To: Jim Avery <[EMAIL PROTECTED]> Jim I'm confused 1. Nagios 2.9 comes with flapping turned off globally by default : # Values: 1 = enable flap detection # 0 = disable flap detection (default) enable_flap_detection=0 2. It also comes with check_for_orphaned_services=1 3. Most importantly it comes with a localhost.cfg file that has 2 nested host templates from the start. One called 'generic-host' and one called 'linux-server' which uses 'generic-host' Then it has a host description that uses 'linux-host' so we have 3 levels of recursion right from the start. It does the same thing with service templates. I assume you must not have looked at the defaults at all or just changed it back to just one template. I never used more than one before either but since the default configs suggest it I figured it'd be ok. Anyone else here uses object inheritance with multi-level recursion ? On 6/15/07, Jim Avery <[EMAIL PROTECTED]> wrote: On 15/06/07, stucky <[EMAIL PROTECTED]> wrote: > Actually, flapping is turned off globally in nagios.cfg and although it's > still turned on in the host template it shouldn't matter right ? No, my understanding is that the global setting overrides everything else (I've never turned it off globally myself though). > I turned it off here as well. Might as well. > 1. How can a host be flapping if it's down ? Flapping can be detected on any change of state down-up or up-down. How it works is documented here: http://nagios.sourceforge.net/docs/2_0/toc.html (look under advanced topics). Anyway, from what you're saying, it sounds like flapping isn't your problem. > 2. Are most of you guys on here using flap detection and has it been causing > this kind of problem for anyone ? Well I do, obviously. I guess lots of others do as it is important in preventing getting storms of notifications if a host or service is flapping. If you don't have the f notification option, it can sometimes be confusing to see the notification that a service has gone down, but not get one to show it's come up again. One problem with having flap notifications is that Nagios might detect that the host or service has stopped flapping at any odd hour - if your notifications are going to an on-call pager for example you can end up waking up your on-call engineer unnecessarily. As ever it's up to you to decide what your priorities are. > 3. If it was flapping shouldn't I have seen this in the log? Yes. You'll see it in the alert history for that host (or service). Another thing to try when you get this kind of behaviour is to set check_for_orphaned_services to 1 in the main nagios config file. I've never tried having a template use another template. I'm not saying it shouldn't work, it's just not something I would feel comfortable doing. I'd be interested to hear a definitive answer as to whether that's a good thing to do or not myself as it would come in handy sometimes. cheers, Jim - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null -- stucky -- stucky - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] host-down notification can take 50 mins to be sent
Guys I'm trying the latest stable 2.x version (2.9) and on top of the 2 already existing default host templates I added a 3rd one since the documentation states that there is no limit. I added a host and started monitoring. When I took it down it took between 2 - 5 mins for the host down notification to come in. However, later on I rebooted again and this time nothing came in. The nagios log showed nothing about wanting to send a notification either. The box came back without any notification. I took it down again later and waited - after 50 minutes I got a host down notification. When I brought the host back I almost immediately got a host up notification. I removed one of the the templates to change the recursion level of the host templates from 3 to 2 and tried again. I did 3 tests and all came back fine this time. I always got the notification within 5 minutes max. Then I added the 3rd template back again to see whether it had to do with that but now I can't reproduce this. I did 2 tests and both were fine. I don't feel that I can trust nagios now though. I've been using it for a few years now since version 1.2 and I've never seen this behaviour before. However, I've also never used more than 1 host/service template. This time I wanted to make more use of the object inheritance logic to shorten my cfg but somehow I feel it causes problems. How deep is the template recursion for most of you folks ? Here are the templates I was using when the 50 min delay happened Hosts : # Host templates define host{ namegeneric-host notifications_enabled 1 event_handler_enabled 1 flap_detection_enabled 1 failure_prediction_enabled 1 process_perf_data 1 retain_status_information 1 retain_nonstatus_information1 notification_period 24x7 register0 } define host{ namegeneric-linux use generic-host check_period24x7 max_check_attempts 10 check_command check-host-alive notification_interval 120 notification_optionsd,u,r register0 } define host{ nameprod use generic-linux contact_groups sysadmins,psst register0 } define host{ namenonprod use generic-linux contact_groups sysadmins register0 } Then I use either the prod or nonprod template for all my hosts. same with services : # Service templates define service{ namegeneric-service active_checks_enabled 1 passive_checks_enabled 1 parallelize_check 1 obsess_over_service 1 check_freshness 0 notifications_enabled 1 event_handler_enabled 1 flap_detection_enabled 1 failure_prediction_enabled 1 process_perf_data 1 retain_status_information 1 retain_nonstatus_information1 is_volatile 0 register0 } define service{ namegeneric-checks use generic-service check_period24x7 max_check_attempts 4 normal_check_interval 5 retry_check_interval1 notification_optionsw,u,c,r notification_interval 60 notification_period 24x7 register0 } define service{ nameprod use generic-checks contact_groups sysadmins,psst register0 } define service{ namenonprod use generic-checks contact_groups sysadmins register0 } Here I also use prod or nonprod as templates for my services. I'm gonna test the more tomorrrow but I'm worried that if a host goes down I might not get notified again until 50 mins later or maybe never who knows ? It doesn't seem to behave the same way every time but as far as I see it the service checks are every 5 minutes so within that time frame I should get a notification. Parallel checks is turned on as well.
[Nagios-users] host down notification but no host up notification ?
ion: Wed Dec 31 16:00:00 1969 [1181696031.131095:032.1] We shouldn't notify about this recovery. [1181696031.131102:032.0] Notification viability test failed. No notification will be sent out. [1181696111.052735:032.0] ** Service Notification Attempt ** Host: 'lithium', Service: 'CFENVD', Type: 0, Current State: 0, Last Notification: Wed Dec 31 16:00:00 1969 [1181696111.052759:032.1] We shouldn't notify about this recovery. [1181696111.052766:032.0] Notification viability test failed. No notification will be sent out. [1181696111.052971:032.0] ** Service Notification Attempt ** Host: 'lithium', Service: 'PERC CONTROLLER', Type: 0, Current State: 0, Last Notification: Wed Dec 31 16:00:00 1969 [1181696111.052984:032.1] We shouldn't notify about this recovery. [1181696111.052992:032.0] Notification viability test failed. No notification will be sent out. [1181696111.053334:032.0] ** Service Notification Attempt ** Host: 'lithium', Service: 'CFEXECD', Type: 0, Current State: 0, Last Notification: Wed Dec 31 16:00:00 1969 [1181696111.053348:032.1] We shouldn't notify about this recovery. [1181696111.053355:032.0] Notification viability test failed. No notification will be sent out. [1181696121.163710:032.0] ** Service Notification Attempt ** Host: 'lithium', Service: 'MEM', Type: 0, Current State: 0, Last Notification: Wed Dec 31 16:00:00 1969 [1181696121.163738:032.1] We shouldn't notify about this recovery. [1181696121.163746:032.0] Notification viability test failed. No notification will be sent out. [1181696121.163984:032.0] ** Service Notification Attempt ** Host: 'lithium', Service: 'DISK USAGE /var', Type: 0, Current State: 0, Last Notification: Wed Dec 31 16:00:00 1969 [1181696121.163998:032.1] We shouldn't notify about this recovery. [1181696121.164005:032.0] Notification viability test failed. No notification will be sent out. [1181696141.130999:032.0] ** Service Notification Attempt ** Host: 'lithium', Service: 'DISK USAGE /', Type: 0, Current State: 0, Last Notification: Wed Dec 31 16:00:00 1969 [1181696141.131023:032.1] We shouldn't notify about this recovery. [1181696141.131031:032.0] Notification viability test failed. No notification will be sent out. Clearly, nagios decided that I shouldn't get a host up notification. I just don't understand why. From the log files I'd say the following logic takes place : 1. Host goes down - service check fails 2. Nagios checks to see if host is down - YES 3. Because of step 2. no service notifications are sent 4. Host down notification is sent instead 5. Host comes back 6. Service checks start recovering - no service recovery notification is sent since no service problem notifications were sent in the first place. 7. Host is assumed to be up since service is up 8. Hence - no host up notification. First I thought my host up notification might not make it through one of the notification filters but according to the log there is NO HOST check after the reboot therefore there is no host notification attempt. Looks to me like a design bug but I wanna make sure I'm not getting this wrong. It just doesn't make sense to me that I wouldn't be notified about a host coming back. I understand the part about the services. INTERESTING: I have rebooted a few times and it appears that sometimes I do get host up notifications but most of the time I don't so it seems to have to do with when exactly the reboot occurs. Also, I turned off flapping globally but no difference. Anyone seen this behaviour ? -- stucky - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] timeouts when using secondary dns
Yey !! That totally did it. Thx AZ I hadn't even considered messing with the resolver cuz I was sure it was a nagios issue so I had to fix nagios.If that wasn't a text book example of how well mailinglists can work then I don't know what is... thxOn 11/7/06, Az <[EMAIL PROTECTED]> wrote: stucky wrote:> I use the check_by_ssh plugin for most of my stuff and I noticed that> if the primary nameserver is unavailable nagios starts freaking out.> All of a sudden all plugins time out. I tested it using the 'host' > command and it only takes about 1 second longer to lookup hosts using> the secondary nameserver.> The default timeout for check_by_ssh is 10 seconds. I cranked it up to> 30 and still I get timeouts. I'm not sure I understand that one. > Has anyone else seen this.We had a similar issue in that our primary DNS was doing strange things,and it quite often took 5 or even 10 seconds to perform a DNS lookup.What we were seeing was 70% of service checks (and subsequently host checks) failing by timing out. The key was the multiple of 5 seconds.The resolver timeout on, say, RHEL3 is based on RES_TIMEOUT inresolv.h... which was 5 seconds.We added the following to our resolv.conf , and found the problems went away:options timeout:2 rotateThis sets the timeout for waiting for a reply to 2 seconds, and tellsthe resolve to rotate through your 'nameserver' entries rather than always hitting #1, then #2, etc.Cheers.-- stucky - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
[Nagios-users] timeouts when using secondary dns
GuysI use the check_by_ssh plugin for most of my stuff and I noticed that if the primary nameserver is unavailable nagios starts freaking out.All of a sudden all plugins time out. I tested it using the 'host' command and it only takes about 1 second longer to lookup hosts using the secondary nameserver. The default timeout for check_by_ssh is 10 seconds. I cranked it up to 30 and still I get timeouts. I'm not sure I understand that one.Has anyone else seen this.-- stucky - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642___ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
Re: [Nagios-users] check_dig bug ?
sorry forgot to mention - yes it's RHEL4. Hmm ok I will try the links. ThxOn 4/5/06, joshua sahala <[EMAIL PROTECTED] > wrote:-BEGIN PGP SIGNED MESSAGE-Hash: SHA1On (2006-04-05 10:57), stucky wrote: >> I was using check_dns before but it would constantly give me random timeout> alerts although dns is fine. Considering that nslookup is being phased out> anyways I decided to give check_dig a try. > It works better but seems to have a bug related to the warning timeouts.> Check this out :is this on some RedHat variant bychance? there was a known RH kernel bug thatwas supposedly fixed, but my RHES4 still suffers from this same problem. I use check_dns.pl instead:http://www.hannes-schulz.de/?doc=proj&proj=nagios http://www.nagiosexchange.org/Networking.53.0.html?&tx_netnagext_pi1%5Bp_view%5D=27hth/joshua- --A common mistake that people make when trying to design somethingcompletely foolproof is to underestimate the ingenuity of complete fools.- Douglas Adams --BEGIN PGP SIGNATURE-Version: GnuPG v1.4.2.2 (GNU/Linux)iD8DBQFENCm3Jr8VjiIHVH0RAn6jAKDUvO51XHqGYKfujk0yTekLEYjjYQCgtWpYwrFkmo2Zo39OjW0JoYFnf5g==LqbT -END PGP SIGNATURE--- stucky
[Nagios-users] check_dig bug ?
guys I was using check_dns before but it would constantly give me random timeout alerts although dns is fine. Considering that nslookup is being phased out anyways I decided to give check_dig a try. It works better but seems to have a bug related to the warning timeouts. Check this out : [EMAIL PROTECTED] stucky]# /usr/local/nagios/libexec/check_dig -w 1 -c 2 -H {nameserver} -l {fqdn} -a {ip} DNS OK - 0.008 seconds response time ({fqdn} 38400 IN A {ip})|time=0.008196s;1.00;2.00;0.00 [EMAIL PROTECTED] stucky]# /usr/local/nagios/libexec/check_dig -w 1 -c 2 -H {nameserver} -l {fqdn} -a {ip} DNS WARNING - 0.011 seconds response time ({fqdn} 38400 IN A {ip})|time=0.011227s;1.00;2.00;0.00 Have I totally gone nuts or did I not just tell check_dig to only warn me if the query takes more than one second ? As you can see the tool itself reports it took only 0.011 seconds so why the warning ? It's annoying cause I get those random fake alerts and another recovery messager soon after. Thing is even if I totally leave the -w and -c flags out it'll still do that as if it had a hardcoded value between 0.008 and 0.011 in there that can't be changed. This is from nagios-plugins-1.4.2.-- stucky
Re: [Nagios-users] Remote host check methods
I can only agree. I do all my checking with ssh and it works wonderfully. Every monitored machine has a 'nagios' user that is used to log on to it via an ssh key where only the nagios box has the privkey to. In the pubkey I use the 'command' directive to force a sanity check on every command that is passed via ssh. This command compares what's in the SSH_ORIGINAL_COMMAND environment variable to a list of allowed commands. If it passes it get executed, otherwise nagios errors. It's a simple perl script I wrote. So the authorized_keys file on all hosts for user 'nagios' looks like that: from="{nagioshost}",command="/usr/local/nagios/home/acl_agent",no-port-forwarding,no-X11-forwarding,no-agent-forwarding ssh-dss..key stuff This way the key can only be used from the nagios box for exactly the commands that the plugins need to run. acl_agent is the perl script and the acl's themselves are maintained via cfengine.On 3/30/06, Bill Jacqmein <[EMAIL PROTECTED]> wrote: Im a bigger fan of check by ssh for unix like OSes.On 3/29/06, Randall Perry < [EMAIL PROTECTED]> wrote:> Got it running, but am having trouble with SSL, which I'll detail in a> separate post.>>>> Randall Perry wrote: > > I'm new to Nagios. Got it configured and running on my monitoring box.> > Tried installing NRPE on a remote host running Mac OSX but couldn't get> > it to run as daemon or through xinetd. > >> > There seem to be several methods to check remote hosts including SSH> > plugins.> >> > Just wondering what other's method of choice is for this -- especially> > on OSXS. > >> > TIA> >>>> --> Randall Perry> sysTame>> Xserve Web Hosting/Co-location/Leasing> QuickTime Streaming> Mac Consulting/Sales >> http://www.systame.com/>>>>> ---> This SF.Net email is sponsored by xPML, a groundbreaking scripting language > that extends applications into web and mobile media. Attend the live webcast> and join the prime developer group breaking into this new coding territory!> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642> ___> Nagios-users mailing list> Nagios-users@lists.sourceforge.net> https://lists.sourceforge.net/lists/listinfo/nagios-users> ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. > ::: Messages without supporting info will risk being sent to /dev/null>---This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcastand join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmdlnk&kid0944&bid$1720&dat1642___Nagios-users mailing listNagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null-- stucky
[Nagios-users] bizarre values in availability cgi
guys I know a lot of questions have been asked about this but I have not found any answers yet so I decided to drop another line. I have had nagios 2.0 running for a year now and it's been great. One thing that baffles me though is the availability reporting. We had 2 production outages the other day (one scheduled and one unscheduled). The scheduled one was about of about 37 minutes. The unscheduled one about 7-10 minutes a couple of hours later. This was after a few months of 100% uptime. When I created an availability report over the last 7 days I got these values. Report period : last 7 days Report time Period: 24x7 First Assumed Service State: Service OK OK Unscheduled 6d 23h 28m 37s 99.689% 99.689% Total 6d 23h 28m 37s 99.689% 99.689% CRITICAL Unscheduled 49710d 6h 22m 1s -0.062% -0.062% Scheduled 0d 0h 37m 38s 0.373% 0.373% Total 0d 0h 31m 23s 0.311% 0.311% Now the scheduled stuff seems about ok but the unscheduled stuff doesn't make sense. First of all negative percentages ?? Second, how can a box be down for 49710 days 6 hours and 22 mins in a 7 day period ? The actual downtime was about 10 minutes but I don't see that in the report at all. I did a test after with another box. I created 2 outages. 1. Unscheduled for about 15 mins 2. scheduled for about 8 mins The report I did after that looks much better: OK Unscheduled 6d 22h 43m 14s 99.238% 99.238% Scheduled 0d 0h 55m 22s 0.549% 0.549 Total 6d 23h 38m 36s 99.788% 99.788% CRITICAL Unscheduled 0d 0h 14m 21s 0.142% 0.142% Scheduled 0d 0h 7m 3s 0.070% 0.070% Total 0d 0h 21m 24s 0.212% 0.212% According to that the reporting seems fine but over a long period of time the reports seem to get funny as shown above. I'm sorry if this has been discussed endlessly. If so could someone point me to those threads please ? Thx -- stucky