Hi Mark, thank you for your answer,

Marc Powell a écrit:
Nagios is first and foremost a service monitor, not a host monitor. Host monitoring is only necessary, as far as nagios is concerned, for two reasons -- - notification supression. If the host is down, don't notify about the services. They're still down so show them down, but don't wake anybody up over it if they're not also responsible for the host.
        - parenting/unreachable logic.

I agree with you. Parenting / unreachable logic is a very good thing. But I think it should allow to declare a service as a child of its host. This parent/child logic can suppress 'notifications'. I think it could also suppress the display of inaccurate 'status' on the console window.

We do not use email notifications, because we are only 2 guys, and this would generate too much messages. We periodically check the web console, and we use on all our PCs small plugins for Firefox and Windows that display in a small popup the list of errors/warnings. When a host is down, we just get pages of errors about all service errors, when we would like to have just one. It would be interesting for us if the parent/child notification suppression mechanism could also suppress these unwanted displays.

Nagios is designed to show the current state of services as accurately as possible. This helps explain the 'why' of the behavior you are seeing and works very well to cover the edge cases that your goal won't catch. For example, if your host check is a ping and something borks ICMP on your network, you would have all the services on that host disabled and set to unknown, even though they are working just fine.

That's not what happens. Most of the monitored hosts are located on WANs. These links, at least those from my office, are used only for remote control and remote administration, thus they're build with cheap technologies, not intended to be highly reliable. When a host becomes not pingable, then it usually means the WAN link is down. The action is usually to reboot a router, or reset a VPN tunnel. But, during this time, there's no sense for me to send hundreds of checks through this wan, because they will fail. And there's no need for me to know the services are in a failed status. They may be working fine. But the service check won't have any chance of success, because of WAN failure. Then, what I would expect in the service status is "UNKNOWN". Same as when a child becomes "UNREACHABLE" because of parent down

Your understanding of exactly what is impacted on that host is now completely wrong. By artificially changing the service state, your reporting is no longer reliable as well. You may be fine with that but understand that your goal is opposite of what nagios is meant to do.

In my configuration, WAN failures occur far more often than general crash of a host causing lots of services down. I agree with you, when the WAN is down, my understanding of exactly what is impacted on the host is completely wrong. Nagios says all the services are down, when it should say, in my opinion, that it could not determone the status of the services.

Moreover, plugins from various sources behave differently when the host is unreachable. Some plugins return UNKNOWN, which may be the most accurate result in such a sutuation. But some plugins return FAILED, and also some plugins return WARNING. This adds a little bit more confusion to the console, where it may not be easy to find the original problem.


Instead of disabling the service checks, you may be able to use adaptive monitoring to change the service check_commands to something that always returns UNKNOWN (i.e. check_dummy).

I already think about that. But I would have to change every check_command for every service. And, more complicated, I will have to put back the contents of all the original service checks when the host comes back. About disabling the services, there's an external command called "DISABLE ALL SERVICE CHECKS" for a particular host, so that I can disable all services in one go But to change service check_commands, I would have to do that for every service, which would be very huge and quite difficult to maintain ! Each remote server has approximately 20 service checks, some hundred services total, and this is only the beginning, the full setup would require some thousands of checks, all of them located over poor WAN links...

In fact, parent/child mechanism seems to be the right way to handle hosts located over WANs or routers. In my opinion, it should be possible to consider services as childs of their parent host. This may be a feature request for future versions...

Following this idea, I will investigate the following :
- Hosts associated themselves with parent/child relationship according to WAN topology (already working) - For each host, I will create a "parent" service with only a check_alive command
- Every other service will be a child of this parent service

I'll try right now. Comments and suggestions are welcome. Am I the only one having this problem ?

Kind regards
--

*Toussaint OTTAVI*
*MEDI INFORMATIQUE*
*Mail:* [EMAIL PROTECTED]

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
Nagios-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Reply via email to