Thanks a million for pointing out the 'SCHEDULE_FORCED_SVC_CHECK', I'm
now rewriting and testing the event handlers to take care of this. If
only there were a macro/variable of the master service... looking for a
lightweight way to determine the service_description to pass to the
macro that is the direct parent of the check that just failed.
WRT the SSH/SNMP dependency issue, I have a feeling that I'm missing
something here altogether, or didn't include enough info in my initial
report, as both you and Hugo mentioned a possible issue with this.
To be clear, I'm doing this only so that if a dependent service IS down
(Ganglia) and SNMP has been shown to be up (after
'SCHEDULE_FORCED_SVC_CHECK',) I need to (or want to) make sure that SSH
is running before attempting to connect. There are enough failure modes
that occur causing SSH to die at the same time as other services that I
want to avoid a bunch of high-latency/timeout/CPU event handlers running
if they are bound to fail.
Thanks for the accurate pointer to that macro,
Cheers,
/eli
Here's the output of view config showing that it is configured the way I
think... just not sure if that is something I don't want to do :)
HostService HostService Dependency Type Dependency Failure Options
deathstar1001 SNMP-- Ganglia running deathstar1001 SNMP Notification
Warning, Unknown, Critical, Pending
deathstar1001 SNMP-- Ganglia running deathstar1001 SNMP Check
Execution Warning, Unknown, Critical, Pending
deathstar1001 SNMP-- NTP running deathstar1001 SNMP Notification
Warning, Unknown, Critical, Pending
deathstar1001 SNMP-- NTP running deathstar1001 SNMP Check Execution
Warning, Unknown, Critical, Pending
deathstar1001 SNMP-- cron running deathstar1001 SNMP Notification
Warning, Unknown, Critical, Pending
deathstar1001 SNMP-- cron running deathstar1001 SNMP Check Execution
Warning, Unknown, Critical, Pending
deathstar1001 SNMP-- automounter running 4 instances deathstar1001
SNMP Notification Warning, Unknown, Critical, Pending
deathstar1001 SNMP-- automounter running 4 instances deathstar1001
SNMP Check Execution Warning, Unknown, Critical, Pending
deathstar1001 SNMP-- load -lt 4 deathstar1001 SNMP Notification
Warning, Unknown, Critical, Pending
deathstar1001 SNMP-- load -lt 4 deathstar1001 SNMP Check Execution
Warning, Unknown, Critical, Pending
deathstar1001 SNMP deathstar1001 SSH Notification Warning, Unknown,
Critical, Pending
deathstar1001 SNMP deathstar1001 SSH Check Execution Warning, Unknown,
Critical, Pending
John P. Rouillard wrote:
Hi Eli:
You didn't say what version of nagios you are running so I'll assume
2.0.
In message [EMAIL PROTECTED],
Eli Stair writes:
The question comes down to this:
Should a failed service check for a dependent trigger a check of its
parent before continuing?
IIRC from the code it does not force a check of the parent service. I
can see arguments for and against forcing a poll of the parent. Also
the documentation:
http://nagios.sourceforge.net/docs/2_0/dependencies.html
in the How Service Dependencies Are Tested section, says:
Nagios gets the current status of the service that is being depended upon.
not nagios repolls the service being depended upon. A footnote
says:
by default, Nagios will use the most current hard state of the
service(s) that is/are being depended upon
an option in the config file will allow it to use the current soft
state instead. I use the soft state of the service being depended upon
myself.
If this is not the case, or default, is there _ANY_ way to implement this?
Sort of. The event handler for the child can send a
SCHEDULE_FORCED_SVC_CHECK external command for the parent specifying
the current time in seconds. See
http://www.nagios.org/developerinfo/externalcommands/commandinfo.php?command_id=129
for details. The command will be acted upon immediately since nagios
reads the external command file after an event handler runs. Use this
to force an update of the current service status for the parent. Parse
through the objects.cache (probably in /var/log/nagios/objects.cache)
file for the expanded servicedependency objects to find the service
dependencies that match your host/service.
I set my nagios options so that:
max_check_attempts(dependent)*retry_check_interval(dependent)
normal_check_interval(parent)
This way the parent service will be checked at least once during the
soft error interval of the dependent service.
I want to avoid at all costs having an every-minute check of the parent
processes on many thousand hosts just to keep from having the child
process checks and event handlers going hay-wire.
You need to use the max_check_attempts to provide a buffer in which
the parent service will be checked. You can have your event handler
submit an external command on the first soft error and try to fix the
problem on a subsequent soft, or hard error. You don't have any of