Antoine Musso wrote: > Andreas Ericsson a écrit : > > Turn off OCHP and OCSP and then reload Nagios. If that doesn't help, >> unload NDOUtils and then restart Nagios. If that helps, re-enable the >> OCSP/OCHP commands again. If it's working then, it was NDOUtils fault. >> If not, it's the combined load of NDOUtils and the OC?P commands. >> >> OC?P commands add a rather extraordinary amount of load to the system >> irrespective of how simple they are. Usually, you'd be better off >> replacing them with an extremely simple NEB-module. > > Hello Anderas, > > Thanks for answering :) The OCHP/OCSP stuff did not help but I found out > two causes for our high latencies : > > > The first one is a long timeout on a synchronous check. When a service > is non-OK, the main nagios thread trigger a synchronous host check : > > [1223978320.237542] [016.1] [pid=10613] Service is in a non-OK state! > [1223978320.237547] [016.1] [pid=10613] Host is currently UP, so we'll > recheck its state to make sure... > [1223978320.237694] [256.1] [pid=10613] Running command check_ping... > [1223978330.251788] [256.1] [pid=10613] Execution time=10.009 sec > > While this plugin is executing, nagios is just idling and the latency > raise up really fast. So I modified our check_host_alive command to > timeout after 3 seconds, still have to found optimal parameters. > > > > The second issue is ndo. We have ndo2db listening on a database server > on the same switch. ndomod send everything (data_processing_options > parameter set to -1) over a tcp connection. > > I analyzed the callback debugging messages (debug 64, verbosity 2) over > a period of 938 seconds : > > AGGREGATED_STATUS_DATA (#25) 124 calls, avg: 0.00s (total 0.00s) > LOG_DATA (# 9) 366 calls, avg: 0.00s (total 0.56s) > SERVICE_STATUS_DATA (#20) 8321 calls, avg: 0.02s (total 149.48s) > HOST_CHECK_DATA (#14) 4933 calls, avg: 0.01s (total 50.23s) > TIMED_EVENT_DATA (# 8) 12282 calls, avg: 0.00s (total 26.26s) > PROGRAM_STATUS_DATA (#18) 177 calls, avg: 0.00s (total 0.76s) > STATE_CHANGE_DATA (#30) 195 calls, avg: 0.00s (total 0.45s) > SERVICE_CHECK_DATA (#13) 12189 calls, avg: 0.01s (total 82.74s) > SYSTEM_COMMAND_DATA (#10) 8584 calls, avg: 0.01s (total 102.84s) > HOST_STATUS_DATA (#19) 2544 calls, avg: 0.03s (total 64.67s) > > 49715 calls, 477 seconds 50,85% of time spent on sending ndo messages. > > The impact on latency is really bad, one of my colleague filtered out > some of those callbacks (data_processing_options set to 276673) that > seems to help :) > > > This raise two new questions: > > 1/ is there any recommended setting for checking the liveness of a > host ? Since this check is synchronous, we want Nagios to achieve this > as fast as possible. >
Upgrade to Nagios 3. It sports parallell hostchecks. > 2/ is ndomod waiting for ndo2db to insert the data in the database ? That I don't know, but Nagios waits for the NEB to finish its call before proceeding, so if ndomod runs some uninterruptable IO, you might be in for a long wait. Nagios has to wait for the NEB to finish though, so the only solution is to make sure the NEB returns control to Nagios asap. > If not, I am going to check why it takes so long to send a packet to the > remote ndo2db instance. > That's a good idea. Please let us know what you find. -- Andreas Ericsson [EMAIL PROTECTED] OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null