[Nagios-users] Host down, still doing active checks, causing multiple unwanted service failures

Toussaint OTTAVI Mon, 08 Dec 2008 09:45:54 -0800

Hi list,

I've been investigating this problem for a while, but I couldn't find agood solution.


* Example situation :
Assume I have one host with 20 service checks.

* Problem :

If the host becomes DOWN, Nagios still continues to do service checks onthis host. So, after a while, all the services will go to a CRITICALstate. Then, in my console, I will see :

 - 1 Host down,
 - 20 Services down

This information is not pertinent. The only information I would see insuch a case is the "host down". The 20 "service down" informations areobvious, and generate a "visual pollution" that may prevent to easilyidentify the problem.


* Expected behavior :
When a host is down, I would like to :
- See only one thing in red in the console : 1 HOST DOWN

- Disabling all the service checks (which at this point do not have anychance of success)

- Put the service into "UNKNOWN" status

Comments:

In Nagios, there are parent/child dependencies. When a host is down, allthe child hosts are not tested, and their status becomes "UNREACHABLE".Good thing. Same thing for services. But, as far as I know, there are nodependencies between a host and its services. I googled/read a lot ofthings in the docs. This seems to be "by design", there's no way todeclare a service as a child of its (parent) host ! I didn't reallyunderstand the reasons of this choice, but I would like to work around.

Then I played around with event handlers. When a host status changes,the event handler calls a script. The script checks the status of the"calling" host. If the host is DOWN or UNREACHABLE, it sends back toNagios an "external command" to disable all active service checks. Ifthe status of the host is UP, then it sends the external command toenable all service checks for that particular host. It works. But thereis some "latency" between the time the services are disabled by theeventhandler, and the time Nagios stops doing the service checks.Usually, some services are still checked, and provide unwanted "FAILED"status. I think this is because these checks were queued before thehandler disabled them, thus they're executed. So I'm not s100% satisfied.

The next step would be to use service event handlers to put everyservice into "UNKNOWN" status each time a service check is disabled. ButI have two problems :- In my external script, I can not determine if a service check isENABLED of DISABLED. There are a lot of "macros" available, but none ofthem gives me this information.- This may not solve the "latency" problem, if I manually set an"UNKNOWN" status on a DISABLED service, but an active check is alreadyin the queue, and its result will arrive later...

Of course, the ideal situation would be to have a parent/childdependancy acting between hosts and services...

Any comments and suggestions are welcome. Thank you in advance for yourhelp.


Kind regards
--

*Toussaint OTTAVI*
*MEDI INFORMATIQUE*
***Mail:* [EMAIL PROTECTED]

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/

_______________________________________________
Nagios-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

[Nagios-users] Host down, still doing active checks, causing multiple unwanted service failures

Reply via email to