hi all, yesterday I attempted to implement passive checks for a volatile service with 1 second interval (i.e., once a second, the status of a service is written to Nagios command file), but I am experiencing some problems with how the service status is displayed (and notifications). Since I haven't implemented such checks before, I'd like to consult with more experienced users if Nagios alone is suitable for monitoring externally submitted checks with such a short interval.
If the service is up, the Nagios log shows that it reads the status without any delay from its command file: [1186719368] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;0;node03 up at 1186719368 [1186719369] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;0;node03 up at 1186719369 [1186719370] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;0;node03 up at 1186719370 [1186719371] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;0;node03 up at 1186719371 [1186719372] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;0;node03 up at 1186719372 However, then the service goes to a critical state: [1186719373] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;2;node03 DOWN at 1186719373 and starting from this moment, external checks are read from command file with 9-10 second intervals, with a "service alert" and notification at the end of each activity burst: [1186719384] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;2;node03 DOWN at 1186719374 [1186719384] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;2;node03 DOWN at 1186719375 [1186719384] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;2;node03 DOWN at 1186719376 [1186719384] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;2;node03 DOWN at 1186719377 [1186719384] SERVICE ALERT: node03;NodeState;CRITICAL;HARD;1;node03 DOWN at 1186719373 Then the service goes up, and the after a while I am seeing the following log entries: [1186719447] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;node03 up at 1186719447 [1186719447] Warning: The results of service 'NodeState' on host 'node03' are stale by 11 seconds (threshold=60 seconds). I'm forcing an immediate check of the service. I am the freshness checks enabled, and the the service status is reported as stale, although it was reported as normal shortly before. As a result, I am seeing service notifications with wrong timestamps - the notifications appear after 18 second intervals, although the DOWN service checks are submitted after 1 second intervals. In addition, the service status is reported as stale after it has gone up. Is there a way to speed up the processing of CRITICAL service checks? I'd like to get a notification within the same second. br, risto ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null