Re: [Nagios-users] Host down, still doing active checks, causing multiple unwanted service failures

Toussaint OTTAVI Tue, 09 Dec 2008 04:14:43 -0800

Hi Mark, thank you for your answer,

Marc Powell a écrit:

Nagios is first and foremost a service monitor, not a host monitor.Host monitoring is only necessary, as far as nagios is concerned, fortwo reasons --- notification supression. If the host is down, don't notify aboutthe services. They're still down so show them down, but don't wakeanybody up over it if they're not also responsible for the host.
        - parenting/unreachable logic.

I agree with you. Parenting / unreachable logic is a very good thing.But I think it should allow to declare a service as a child of its host.This parent/child logic can suppress 'notifications'. I think it couldalso suppress the display of inaccurate 'status' on the console window.

We do not use email notifications, because we are only 2 guys, and thiswould generate too much messages. We periodically check the web console,and we use on all our PCs small plugins for Firefox and Windows thatdisplay in a small popup the list of errors/warnings. When a host isdown, we just get pages of errors about all service errors, when wewould like to have just one. It would be interesting for us if theparent/child notification suppression mechanism could also suppressthese unwanted displays.

Nagios is designed to show the current state of services as accuratelyas possible. This helps explain the 'why' of the behavior you areseeing and works very well to cover the edge cases that your goalwon't catch. For example, if your host check is a ping and somethingborks ICMP on your network, you would have all the services on thathost disabled and set to unknown, even though they are working justfine.

That's not what happens. Most of the monitored hosts are located onWANs. These links, at least those from my office, are used only forremote control and remote administration, thus they're build with cheaptechnologies, not intended to be highly reliable. When a host becomesnot pingable, then it usually means the WAN link is down. The action isusually to reboot a router, or reset a VPN tunnel. But, during thistime, there's no sense for me to send hundreds of checks through thiswan, because they will fail. And there's no need for me to know theservices are in a failed status. They may be working fine. But theservice check won't have any chance of success, because of WAN failure.Then, what I would expect in the service status is "UNKNOWN". Same aswhen a child becomes "UNREACHABLE" because of parent down

Your understanding of exactly what is impacted on that host isnow completely wrong. By artificially changing the service state, yourreporting is no longer reliable as well. You may be fine with that butunderstand that your goal is opposite of what nagios is meant to do.

In my configuration, WAN failures occur far more often than generalcrash of a host causing lots of services down. I agree with you, whenthe WAN is down, my understanding of exactly what is impacted on thehost is completely wrong. Nagios says all the services are down, when itshould say, in my opinion, that it could not determone the status of theservices.

Moreover, plugins from various sources behave differently when the hostis unreachable. Some plugins return UNKNOWN, which may be the mostaccurate result in such a sutuation. But some plugins return FAILED, andalso some plugins return WARNING. This adds a little bit more confusionto the console, where it may not be easy to find the original problem.

Instead of disabling the service checks, youmay be able to use adaptive monitoring to change the servicecheck_commands to something that always returns UNKNOWN (i.e.check_dummy).

I already think about that. But I would have to change everycheck_command for every service. And, more complicated, I will have toput back the contents of all the original service checks when the hostcomes back. About disabling the services, there's an external commandcalled "DISABLE ALL SERVICE CHECKS" for a particular host, so that I candisable all services in one go But to change service check_commands, Iwould have to do that for every service, which would be very huge andquite difficult to maintain ! Each remote server has approximately 20service checks, some hundred services total, and this is only thebeginning, the full setup would require some thousands of checks, all ofthem located over poor WAN links...

In fact, parent/child mechanism seems to be the right way to handlehosts located over WANs or routers. In my opinion, it should be possibleto consider services as childs of their parent host. This may be afeature request for future versions...


Following this idea, I will investigate the following :

- Hosts associated themselves with parent/child relationship accordingto WAN topology (already working)- For each host, I will create a "parent" service with only acheck_alive command

- Every other service will be a child of this parent service

I'll try right now. Comments and suggestions are welcome. Am I the onlyone having this problem ?


Kind regards
--

*Toussaint OTTAVI*
*MEDI INFORMATIQUE*
*Mail:* [EMAIL PROTECTED]

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/

_______________________________________________
Nagios-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Re: [Nagios-users] Host down, still doing active checks, causing multiple unwanted service failures

Reply via email to