Re: [Nagios-users] host-down notification can take 50 mins to be sent

2007-06-15 Thread stucky

I suggest you wait since 3.x alpha is out now but I've had worse probs with
this one so I reverted back to the latest stable release.
I've also been running 2.0a for another site and have never had this problem
with this version before.
So I'd be interested to see your results with 2.9 or 3.x

thx for your all help

On 6/15/07, Jim Avery <[EMAIL PROTECTED]> wrote:


On 15/06/07, stucky <[EMAIL PROTECTED]> wrote:
> Jim
>
> I'm confused
>
> 1. Nagios 2.9 comes with flapping turned off globally by default :
>
> # Values: 1 = enable flap detection
> # 0 = disable flap detection (default)
>
> enable_flap_detection=0
>
> 2. It also comes with check_for_orphaned_services=1
>
> 3. Most importantly it comes with a localhost.cfg file that has 2 nested
> host templates from the start.
>
> One called 'generic-host' and one called 'linux-server' which uses
> 'generic-host'
> Then it has a host description that uses 'linux-host' so we have 3
levels of
> recursion right from the start.
> It does the same thing with service templates.
>
> I assume you must not have looked at the defaults at all or just changed
it
> back to just one template.
> I never used more than one before either but since the default configs
> suggest it I figured it'd be ok.

I confess I don't use Nagios 2.9 yet.  Maybe it's time I should!

I'm hoping to buy a new server for Nagios soon, as the old one is
creaking under the strain of all the active checks and rrd databases.
When that arrives I'll drag myself back up to the cutting edge.  I
haven't looked at the defaults for 18 months or so.

cheers,

Jim

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when
reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null





--
stucky
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

[Nagios-users] Fwd: host-down notification can take 50 mins to be sent

2007-06-15 Thread stucky

-- Forwarded message --
From: stucky <[EMAIL PROTECTED]>
Date: Jun 15, 2007 11:17 AM
Subject: Re: [Nagios-users] host-down notification can take 50 mins to be
sent
To: Jim Avery <[EMAIL PROTECTED]>

Jim

I'm confused

1. Nagios 2.9 comes with flapping turned off globally by default :

# Values: 1 = enable flap detection
# 0 = disable flap detection (default)

enable_flap_detection=0

2. It also comes with check_for_orphaned_services=1

3. Most importantly it comes with a localhost.cfg file that has 2 nested
host templates from the start.

One called 'generic-host' and one called 'linux-server' which uses
'generic-host'
Then it has a host description that uses 'linux-host' so we have 3 levels of
recursion right from the start.
It does the same thing with service templates.

I assume you must not have looked at the defaults at all or just changed it
back to just one template.
I never used more than one before either but since the default configs
suggest it I figured it'd be ok.

Anyone else here uses object inheritance with multi-level recursion ?

On 6/15/07, Jim Avery <[EMAIL PROTECTED]> wrote:


On 15/06/07, stucky <[EMAIL PROTECTED]> wrote:
> Actually, flapping is turned off globally in nagios.cfg and although
it's
> still turned on in the host template it shouldn't matter right ?

No, my understanding is that the global setting overrides everything
else (I've never turned it off globally myself though).

> I turned it off here as well.

Might as well.

> 1. How can a host be flapping if it's down ?

Flapping can be detected on any change of state down-up or up-down.
How it works is documented here:
http://nagios.sourceforge.net/docs/2_0/toc.html (look under advanced
topics).  Anyway, from what you're saying, it sounds like flapping
isn't your problem.

> 2. Are most of you guys on here using flap detection and has it been
causing
> this kind of problem for anyone ?

Well I do, obviously.  I guess lots of others do as it is important in
preventing getting storms of notifications if a host or service is
flapping.  If you don't have the f notification option, it can
sometimes be confusing to see the notification that a service has gone
down, but not get one to show it's come up again.

One problem with having flap notifications is that Nagios might detect
that the host or service has stopped flapping at any odd hour - if
your notifications are going to an on-call pager for example you can
end up waking up your on-call engineer unnecessarily.  As ever it's up
to you to decide what your priorities are.

> 3. If it was flapping shouldn't I have seen this in the log?

Yes.  You'll see it in the alert history for that host (or service).

Another thing to try when you get this kind of behaviour is to set
check_for_orphaned_services to 1 in the main nagios config file.

I've never tried having a template use another template.  I'm not
saying it shouldn't work, it's just not something I would feel
comfortable doing.  I'd be interested to hear a definitive answer as
to whether that's a good thing to do or not myself as it would come in
handy sometimes.

cheers,

Jim

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when
reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null





--
stucky

--
stucky
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

[Nagios-users] host-down notification can take 50 mins to be sent

2007-06-15 Thread stucky

Guys

I'm trying the latest stable 2.x version (2.9) and on top of the 2 already
existing default host templates I added a 3rd one since the documentation
states that there is no limit.

I added a host and started monitoring. When I took it down it took between 2
- 5 mins for the host down notification to come in.
However, later on I rebooted again and this time nothing came in. The nagios
log showed nothing about wanting to send a notification either. The box came
back without any
notification.
I took it down again later and waited - after 50 minutes I got a host down
notification. When I brought the host back I almost immediately got a host
up notification.

I removed one of the the templates to change the recursion level of the host
templates from 3 to 2 and tried again. I did 3 tests and all came back fine
this time. I always got the notification
within 5 minutes max.
Then I added the 3rd template back again to see whether it had to do with
that but now I can't reproduce this. I did 2 tests and both were fine.

I don't feel that I can trust nagios now though. I've been using it for a
few years now since version 1.2 and I've never seen this behaviour before.
However, I've also never used more than 1 host/service template. This time I
wanted to make more use of the object inheritance logic to shorten my cfg
but somehow I feel it causes problems.
How deep is the template recursion for most of you folks ?

Here are the templates I was using when the 50 min delay happened

Hosts :

# Host templates

define host{
   namegeneric-host
   notifications_enabled   1
   event_handler_enabled   1
   flap_detection_enabled  1
   failure_prediction_enabled  1
   process_perf_data   1
   retain_status_information   1
   retain_nonstatus_information1
   notification_period 24x7
   register0
   }

define host{
   namegeneric-linux
   use generic-host
   check_period24x7
   max_check_attempts  10
   check_command   check-host-alive
   notification_interval   120
   notification_optionsd,u,r
   register0
   }

define host{
   nameprod
   use generic-linux
   contact_groups  sysadmins,psst
   register0
   }

define host{
   namenonprod
   use generic-linux
   contact_groups  sysadmins
   register0
   }

Then I use either the prod or nonprod template for all my hosts.

same with services :

# Service templates

define service{
   namegeneric-service
   active_checks_enabled   1
   passive_checks_enabled  1
   parallelize_check   1
   obsess_over_service 1
   check_freshness 0
   notifications_enabled   1
   event_handler_enabled   1
   flap_detection_enabled  1
   failure_prediction_enabled  1
   process_perf_data   1
   retain_status_information   1
   retain_nonstatus_information1
   is_volatile 0
   register0
   }

define service{
   namegeneric-checks
   use generic-service
   check_period24x7
   max_check_attempts  4
   normal_check_interval   5
   retry_check_interval1
   notification_optionsw,u,c,r
   notification_interval   60
   notification_period 24x7
   register0
   }


define service{
   nameprod
   use generic-checks
   contact_groups  sysadmins,psst
   register0
   }

define service{
   namenonprod
   use generic-checks
   contact_groups  sysadmins
   register0
   }

Here I also use prod or nonprod as templates for my services.

I'm gonna test the more tomorrrow but I'm worried that if a host goes down I
might not get notified again until 50 mins later or maybe never who knows ?
It doesn't seem to behave the same way every time but as far as I see it the
service checks are every 5 minutes so within that time frame I should get a
notification.
Parallel checks is turned on as well.

[Nagios-users] host down notification but no host up notification ?

2007-06-12 Thread stucky
ion: Wed Dec 31 16:00:00 1969
[1181696031.131095:032.1] We shouldn't notify about this recovery.
[1181696031.131102:032.0] Notification viability test failed.  No
notification will be sent out.
[1181696111.052735:032.0] ** Service Notification Attempt ** Host:
'lithium', Service: 'CFENVD', Type: 0, Current State: 0, Last Notification:
Wed Dec 31 16:00:00 1969
[1181696111.052759:032.1] We shouldn't notify about this recovery.
[1181696111.052766:032.0] Notification viability test failed.  No
notification will be sent out.
[1181696111.052971:032.0] ** Service Notification Attempt ** Host:
'lithium', Service: 'PERC CONTROLLER', Type: 0, Current State: 0, Last
Notification: Wed Dec 31 16:00:00 1969
[1181696111.052984:032.1] We shouldn't notify about this recovery.
[1181696111.052992:032.0] Notification viability test failed.  No
notification will be sent out.
[1181696111.053334:032.0] ** Service Notification Attempt ** Host:
'lithium', Service: 'CFEXECD', Type: 0, Current State: 0, Last Notification:
Wed Dec 31 16:00:00 1969
[1181696111.053348:032.1] We shouldn't notify about this recovery.
[1181696111.053355:032.0] Notification viability test failed.  No
notification will be sent out.
[1181696121.163710:032.0] ** Service Notification Attempt ** Host:
'lithium', Service: 'MEM', Type: 0, Current State: 0, Last Notification: Wed
Dec 31 16:00:00 1969
[1181696121.163738:032.1] We shouldn't notify about this recovery.
[1181696121.163746:032.0] Notification viability test failed.  No
notification will be sent out.
[1181696121.163984:032.0] ** Service Notification Attempt ** Host:
'lithium', Service: 'DISK USAGE /var', Type: 0, Current State: 0, Last
Notification: Wed Dec 31 16:00:00 1969
[1181696121.163998:032.1] We shouldn't notify about this recovery.
[1181696121.164005:032.0] Notification viability test failed.  No
notification will be sent out.
[1181696141.130999:032.0] ** Service Notification Attempt ** Host:
'lithium', Service: 'DISK USAGE /', Type: 0, Current State: 0, Last
Notification: Wed Dec 31 16:00:00 1969
[1181696141.131023:032.1] We shouldn't notify about this recovery.
[1181696141.131031:032.0] Notification viability test failed.  No
notification will be sent out.

Clearly, nagios decided that I shouldn't get a host up notification. I just
don't understand why. From the log files I'd say the following logic takes
place :

1. Host goes down - service check fails
2. Nagios checks to see if host is down - YES
3. Because of step 2. no service notifications are sent
4. Host down notification is sent instead
5. Host comes back
6. Service checks start recovering - no service recovery notification is
sent since no service problem notifications were sent in the first place.
7. Host is assumed to be up since service is up
8. Hence - no host up notification.

First I thought my host up notification might not make it through one of the
notification filters but according to the log there is NO HOST check after
the reboot therefore
there is no host notification attempt.
Looks to me like a design bug but I wanna make sure I'm not getting this
wrong. It just doesn't make sense to me that I wouldn't be notified
about a host coming back. I understand the part about the services.

INTERESTING: I have rebooted a few times and it appears that sometimes I do
get host up notifications but most of the time I don't so it seems to have
to do with
when exactly the reboot occurs.
Also, I turned off flapping globally but no difference.

Anyone seen this behaviour ?
--
stucky
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Re: [Nagios-users] timeouts when using secondary dns

2006-11-09 Thread stucky
Yey !! That totally did it. Thx AZ I hadn't even considered messing with the resolver cuz I was sure it was a nagios issue so I had to fix nagios.If that wasn't a text book example of how well mailinglists can work then I don't know what is...
thxOn 11/7/06, Az <[EMAIL PROTECTED]> wrote:
stucky wrote:> I use the check_by_ssh plugin for most of my stuff and I noticed that> if the primary nameserver is unavailable nagios starts freaking out.> All of a sudden all plugins time out. I tested it using the 'host'
> command and it only takes about 1 second longer to lookup hosts using> the secondary nameserver.> The default timeout for check_by_ssh is 10 seconds. I cranked it up to> 30 and still I get timeouts. I'm not sure I understand that one.
> Has anyone else seen this.We had a similar issue in that our primary DNS was doing strange things,and it quite often took 5 or even 10 seconds to perform a DNS lookup.What we were seeing was 70% of service checks (and subsequently host
checks) failing by timing out. The key was the multiple of 5 seconds.The resolver timeout on, say, RHEL3 is based on RES_TIMEOUT inresolv.h... which was 5 seconds.We added the following to our resolv.conf
, and found the problems went away:options timeout:2 rotateThis sets the timeout for waiting for a reply to 2 seconds, and tellsthe resolve to rotate through your 'nameserver' entries rather than
always hitting #1, then #2, etc.Cheers.-- stucky
-
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

[Nagios-users] timeouts when using secondary dns

2006-11-06 Thread stucky
GuysI use the check_by_ssh plugin for most of my stuff and I noticed that if the primary nameserver is unavailable nagios starts freaking out.All of a sudden all plugins time out. I tested it using the 'host' command and it only takes about 1 second longer to lookup hosts using the secondary nameserver.
The default timeout for check_by_ssh is 10 seconds. I cranked it up to 30 and still I get timeouts. I'm not sure I understand that one.Has anyone else seen this.-- stucky
-
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Re: [Nagios-users] check_dig bug ?

2006-04-05 Thread stucky
sorry forgot to mention - yes it's RHEL4. Hmm ok I will try the links. ThxOn 4/5/06, joshua sahala <[EMAIL PROTECTED]
> wrote:-BEGIN PGP SIGNED MESSAGE-Hash: SHA1On (2006-04-05 10:57), stucky wrote:
>> I was using check_dns before but it would constantly give me random timeout> alerts although dns is fine. Considering that nslookup is being phased out> anyways I decided to give check_dig a try.
> It works better but seems to have a bug related to the warning timeouts.> Check this out :is this on some RedHat variant bychance?  there was a known RH kernel bug thatwas supposedly fixed, but my RHES4 still suffers from this same problem.  I
use check_dns.pl instead:http://www.hannes-schulz.de/?doc=proj&proj=nagios
http://www.nagiosexchange.org/Networking.53.0.html?&tx_netnagext_pi1%5Bp_view%5D=27hth/joshua- --A common mistake that people make when trying to design somethingcompletely foolproof is to underestimate the ingenuity of complete
fools.- Douglas Adams --BEGIN PGP SIGNATURE-Version: GnuPG v1.4.2.2 (GNU/Linux)iD8DBQFENCm3Jr8VjiIHVH0RAn6jAKDUvO51XHqGYKfujk0yTekLEYjjYQCgtWpYwrFkmo2Zo39OjW0JoYFnf5g==LqbT
-END PGP SIGNATURE--- stucky


[Nagios-users] check_dig bug ?

2006-04-05 Thread stucky
guys

I was using check_dns before but it would constantly give me random
timeout alerts although dns is fine. Considering that nslookup is being
phased out anyways I decided to give check_dig a try.
It works better but seems to have a bug related to the warning timeouts. Check this out :

[EMAIL PROTECTED] stucky]# /usr/local/nagios/libexec/check_dig -w 1 -c 2 -H {nameserver} -l {fqdn} -a {ip}
DNS OK - 0.008 seconds response time ({fqdn} 38400 IN A {ip})|time=0.008196s;1.00;2.00;0.00

[EMAIL PROTECTED] stucky]# /usr/local/nagios/libexec/check_dig -w 1 -c 2 -H {nameserver} -l {fqdn} -a {ip}
DNS WARNING - 0.011 seconds response time ({fqdn} 38400 IN A {ip})|time=0.011227s;1.00;2.00;0.00

Have I totally gone nuts or did I not just tell check_dig to only warn
me if the query takes more than one second ? As you can see the tool
itself
reports it took only 0.011 seconds so why the warning ?
It's annoying cause I get those random fake alerts and another recovery messager soon after.
Thing is even if I totally leave the -w and -c flags out it'll still do
that as if it had a hardcoded value between 0.008 and 0.011 in there
that can't be changed.

This is from nagios-plugins-1.4.2.-- stucky


Re: [Nagios-users] Remote host check methods

2006-03-30 Thread stucky
I can only agree. I do all my checking with ssh and it works
wonderfully. Every monitored machine has a 'nagios' user that is used
to log on to it via an ssh key where only the nagios box has the
privkey to. In the pubkey I use the 'command' directive to force a
sanity check on every command that is passed via ssh. This command
compares what's in the SSH_ORIGINAL_COMMAND environment variable to a
list of allowed commands.  If it passes it get executed, otherwise
nagios errors. It's a simple perl script I wrote.
So the authorized_keys file on all hosts for user 'nagios' looks like that:

from="{nagioshost}",command="/usr/local/nagios/home/acl_agent",no-port-forwarding,no-X11-forwarding,no-agent-forwarding
ssh-dss..key stuff

This way the key can only be used from the nagios box for exactly the
commands that the plugins need to run. acl_agent is the perl script and
the acl's themselves are maintained via cfengine.On 3/30/06, Bill Jacqmein <[EMAIL PROTECTED]> wrote:
Im a bigger fan of check by ssh for unix like OSes.On 3/29/06, Randall Perry <
[EMAIL PROTECTED]> wrote:> Got it running, but am having trouble with SSL, which I'll detail in a> separate post.>>>> Randall Perry wrote:
> > I'm new to Nagios. Got it configured and running on my monitoring box.> > Tried installing NRPE on a remote host running Mac OSX but couldn't get> > it to run as daemon or through xinetd.
> >> > There seem to be several methods to check remote hosts including SSH> > plugins.> >> > Just wondering what other's method of choice is for this -- especially> > on OSXS.
> >> > TIA> >>>> --> Randall Perry> sysTame>> Xserve Web Hosting/Co-location/Leasing> QuickTime Streaming> Mac Consulting/Sales
>> http://www.systame.com/>>>>> ---> This SF.Net email is sponsored by xPML, a groundbreaking scripting language
> that extends applications into web and mobile media. Attend the live webcast> and join the prime developer group breaking into this new coding territory!> 
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642> ___> Nagios-users mailing list> 
Nagios-users@lists.sourceforge.net> https://lists.sourceforge.net/lists/listinfo/nagios-users> ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null>---This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcastand join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmdlnk&kid0944&bid$1720&dat1642___Nagios-users mailing listNagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null-- stucky


[Nagios-users] bizarre values in availability cgi

2006-03-24 Thread stucky
guys

I know a lot of questions have been asked about this but I have not found any answers yet so I decided to drop another line.

I have had nagios 2.0 running for a year now and it's been great. One
thing that baffles me though is the availability reporting.

We had 2 production outages the other day (one scheduled and one unscheduled). The scheduled one was about of about 37 minutes.
The unscheduled one about 7-10 minutes a couple of hours later. This was after a few months of 100% uptime.
When I created an availability report over the last 7 days I got these values.

Report period : last 7 days
Report time Period: 24x7
First Assumed Service State: Service OK

OK
Unscheduled 6d 23h 28m
37s  
99.689%  99.689%
Total 
6d 23h 28m
37s  
99.689%  99.689%

CRITICAL
Unscheduled 49710d 6h 22m
1s  
-0.062%   -0.062%
Scheduled 0d 0h 37m
38s
0.373% 0.373%
Total 
0d 0h 31m
23s
0.311% 0.311%

Now the scheduled stuff seems about ok but the unscheduled stuff doesn't make sense.
First of all negative percentages ??
Second, how can a box be down for 49710 days 6 hours and 22 mins in a 7 day period ? The actual downtime was about 10 minutes
but I don't see that in the report at all.
I did a test after with another box. I created 2 outages.
1. Unscheduled for about 15 mins 
2. scheduled for about 8 mins

The report I did after that looks much better:

OK
Unscheduled 6d 22h 43m
14s  
99.238%  99.238%
Scheduled 0d 0h 55m
22s
0.549%    0.549
Total 
6d 23h 38m
36s  
99.788%  99.788%

CRITICAL
Unscheduled  0d 0h 14m
21s
0.142%   0.142%
Scheduled  0d 0h
7m
3s
0.070%   0.070%
Total  
0d 0h 21m
24s
0.212%   0.212%

According to that the reporting seems fine but over a long period of time the reports seem to get funny as shown above.
I'm sorry if this has been discussed endlessly. If so could someone point me to those threads please ?

Thx
-- stucky