hi all, i congratulate david on his offer.
i can clearly see the need for per host status information being maintained by mon, in particular when large hostgroups are involved. ive found hostgroups useful for simplifying configurations and web interfaces. i see hostgroups when used in conjunction with parralel monitoring as one of mon's advantages over netsaint. the problem with using the hostgroup model exclusively is that it doesnt scale. ...at all. btw... as we are talking about large installations id like to request that the saving of operational state code be completed. its a little silly to be generating alert storms on reconfigs. best of luck, colm -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of David Nolan Sent: 16 September 2002 14:35 To: [EMAIL PROTECTED] Subject: LONG Proposal: Making mon aware of individual host failures (WAS RE: n00b alert) One of my co-workers has been suggesting that we extend Mon to make it actually be aware of which hosts in a hostgroup are failing. I wasn't sure that I saw the need for that work, but given the recent threads about per-host behavior, I'm beginning to think that I'm in the minority. So, I'm proposing the following changes to Mon. I'll wait a day or two for the discussion to come in, and then assuming the consensus is 'go', I'll make the changes to Mon itself, and the monitors that come with the base package, and I'll send the patches to Jim to be included in the next release. Proposal: Mon needs to be aware of the status of individual hosts. Reasons for this include: Per-host dependencies: inability to ping one server shouldn't necessarily cause us not to care about all services on the rest of the members of the hostgroup Correct alert behavior, depending on the situation: In some cases, like automatically modifying DNS records, you want an alert or upalert on ANY host changing status. In other cases (pagers) you probably only care about 'a new host failed' or 'everything is working'. Right now we have something resembling this behavior by checking for a modified summary. But this is suboptimal, because we alert when going from two hosts failing to one host failing, and if the summary contains (for example) the CPU load of a machine whose load is too high, we might re-alert on every monitor run. Accurate downtime logs: Right now if hosts A, B, and C are all in the same hostgroup, and A fails, followed by B and C failing, the downtime log will only show A. Proposed Strategy: In order to solve this problem, the monitor scripts will have to pass more data back to Mon. However any changes to Mon and the scripts should probably try to be backwards compatible. I suggest we add a new optional service level mon.cfg option, 'monitortype'. If 'monitortype' is set to 'host-extended' then we treat the output from that monitor differently. If it is set to 'legacy', or not set, we expect the current behavior, and some of the new functionality may not be precisely correct. New style monitor scripts need to output extra data. I have two ideas for how to do this. The first is to just output an extra line of data. Instead of a one-line summary, and then a detail message, we output a one-line summary, a one-line list of failed hosts, and then the detail message. This ensures maximum backwards compatibility, as these scripts could just be used with older Mon servers with no problems. On the other hand, we could take the philosophy that we're really outputting structured data, and we might want to add more structure at a later point. So in that case maybe we should consider XML. So the output might look something like: <?xml version="1.0" standalone="yes"?> <monitoroutput name="monitor name here"> <summary>Summary Text Here</summary> <host status="fail" name="host.name.here">Optional reason here</host> <host status="pass" name="host.name.here">Optional status data here</host> <detail>Detailed monitor output here</detail> </monitoroutput Now, I don't want to sound like an XML fan-boy here, so I'll admit that right now, for the amount of extra data I'm talking about encorporating into the output, the overhead of adding XML support seems a bit high. On the other hand, doing this now might make later extensions much easier. Add for perl-based monitor scripts, their exist several modules which will assist in creating the XML easily. (Not that formatting something so simple yourself is hard.) I'm 50-50 on the issue, so I'll let the list decide. We could do both, by supporing 'monitortype host-extended' and 'monitortype xml', if people want both options. That might be the best option, but involves a little extra work. Moving on to "what to do with per-host data" once we're actually getting it from the monitor scripts: Individual hosts going down should be logged to the downtime log. Per host statuses should be passed to mon clients when they ask for opstatus information. (mon.cgi could use this to highlight the rows of hosts which are failing on the hostgroup page.) It should be possible to alert based on per-host data. I would suggest adding 'hostalert' and 'hostupalert' as per-period config options, and calling the alert script once for every host that changes state. So if hosts A and B both start failing at the same time, the hostalert script would be called twice, once for A, once for B. If A then comes back while B is still down, the hostupalert would be called for A. Then sometime later when B is working again, we'll run the hostupalert for B. In conjuction with those changes, the regular alert semantics should change slightly. When alertevery is set, instead of re-alerting after any summary change, by default we'll now re-alert only after the set of failing hosts has gained a previously non-failing host. (i.e. if we've already alerted for A & B failing, and A starts working again, we shouldn't re-alert. However if C starts failing we *should* re-alert.) In order to support the old behavior an 'observe_summary' option should be added to 'alertevery', similar to the current 'observe_detail'. Finally, supporting true per-host dependencies should be possible. I've already actually done most of this work in my current source tree, but I'll re-work it a bit to match the new model. We'd like to have true per-host dependencies (one machine from hostgroup X being unpingable should not cause us to alert for other services failing on X, but should also not cause us to ignore failures of the other services on other machines). Think of this as the 'm' dependency behavior we have now, but on a per-host basis. The implementation involves generating a list of failed hosts in all the direct dependencies of a hostgroup, and if any of those hosts match a host in this hostgroup, not passing that host to the monitor. In order to support both this behavior and alert/monitor suppression behavior on the same hostgroup, we need to split up the dependencies. (An example of wanting both is having per-host dependencies on multiple services on the same hostgroup (http depends on ping), but also have alert suppression based on other hostgroups/services (http & ping on the web group depends on the appropriate router group being up)) To split up the dependencies, my current model is to make it three different config options, 'alertdepend', 'monitordepend' and 'hostdepend'. The old 'depend' and 'dep_behavior' options will still be supported, with a new dep_behavior 'hm' available to mean per-host monitor excludes. So for my example above, on webservers:http you might have: hostdepend SELF:ping alertdepend machine-room-router:ping OR you could have depend machine-room-router:ping dep_behavior a hostdepend SELF:ping Thats the proposal. Discuss. As I said, I'm willing to do most of the work, I just want to give other people the chance to chime in with suggestions for how to do it. -David Nolan Network Software Developer Computing Services Carnegie Mellon University _______________________________________________ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon _______________________________________________ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon