Hi List! I am in the process of upgrading from v2.12 to v3.2.1. As well as upgrading, I am taking the opportunity to move to a new server at the same time. This has allowed me to run both versions in tandem to compare the operation of the two versions.
One difference I noticed straight away was downtime duration on certain hosts. For example, v2 would show a host down for over 2 days yet v3 would show the same host as being down for only a few hours. On investigation, it turned out that the parent of the host on v3 went into a soft down state. This changed the host in question to an unreachable state. The parent host recovered within a minute or so and changed the host back to a down state, effectively resetting the down duration back to zero. I would have expected that the child host should only change state if the parent goes into a hard down state, not a soft down state. I googled for the issue and found one related post from just over a year ago: http://www.mail-archive.com/nagios-users@lists.sourceforge.net/msg25543.html The poster was given various suggestions to circumvent the problem, i.e. tweaking flap detection, increasing time-out on the plugin etc but nothing that seemed to resolve his issue. The posters main problem with this behaviour was that he was getting down e-mail alerts for hosts that are already down due to the state changes. My issue is not with repeated alerts but with the accuracy of the down duration of the host. When our support department look to resolve host problems, they will try and resolve the oldest problems first for obvious reasons of fairness to our customers. This scenario breaks this. In v3, to get an accurate downtime for a host, you would now have to trawl through the alert history or run a trend report for the host to find out when the host really went down. Version 2 does not exhibit this problem. I don't think this is by design but purely down to the way serial host checks work in v2. When a host goes into a soft down state in v2, Nagios cannot do anything else until it has completed all the retries or the host recovers so Nagios never gets the chance to mark the child host unreachable unless it reaches max_check_attempts and determines that the parent host really is down. The original poster of this problem made a good point that Nagios has all the tolerance built in to avoid false alarms on host checks but unfortunately this logic doesn't carry on through child hosts. I can't see that the current way v3 deals with parent/child problems as being desirable for most people, although it seems to have only bothered 2 of us! Thoughts? regards, Aidan ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null