RE: LONG Proposal: Making mon aware of individual host failures (WAS RE: n00b alert)

colm ennis Tue, 17 Sep 2002 03:57:05 -0700

hi all,

i congratulate david on his offer.


i can clearly see the need for per host status information being maintained
by
mon, in particular when large hostgroups are involved.

ive found hostgroups useful for simplifying configurations and web
interfaces.
i see hostgroups when used in conjunction with parralel monitoring as one of
mon's advantages over netsaint. the problem with using the hostgroup model
exclusively is that it doesnt scale. ...at all.

btw... as we are talking about large installations id like to request that
the
saving of operational state code be completed. its a little silly to be
generating
alert storms on reconfigs.

best of luck,

colm

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of David Nolan
Sent: 16 September 2002 14:35
To: [EMAIL PROTECTED]
Subject: LONG Proposal: Making mon aware of individual host failures
(WAS RE: n00b alert)


One of my co-workers has been suggesting that we extend Mon to make it
actually be aware of which hosts in a hostgroup are failing.  I wasn't sure
that I saw the need for that work, but given the recent threads about
per-host behavior, I'm beginning to think that I'm in the minority.

So, I'm proposing the following changes to Mon.  I'll wait a day or two for
the discussion to come in, and then assuming the consensus is 'go', I'll
make the changes to Mon itself, and the monitors that come with the base
package, and I'll send the patches to Jim to be included in the next
release.


Proposal:
Mon needs to be aware of the status of individual hosts.  Reasons for this
include:

Per-host dependencies: inability to ping one server shouldn't necessarily
cause us not to care about all services on the rest of the members of the
hostgroup

Correct alert behavior, depending on the situation: In some cases, like
automatically modifying DNS records, you want an alert or upalert on ANY
host changing status.  In other cases (pagers) you probably only care about
'a new host failed' or 'everything is working'.    Right now we have
something resembling this behavior by checking for a modified summary.  But
this is suboptimal, because we alert when going from two hosts failing to
one host failing, and if the summary contains (for example) the CPU load of
a machine whose load is too high, we might re-alert on every monitor run.

Accurate downtime logs:  Right now if hosts A, B, and C are all in the same
hostgroup, and A fails, followed by B and C failing, the downtime log will
only show A.


Proposed Strategy:
In order to solve this problem, the monitor scripts will have to pass more
data back to Mon.  However any changes to Mon and the scripts should
probably try to be backwards compatible.

I suggest we add a new optional service level mon.cfg option,
'monitortype'.  If 'monitortype' is set to 'host-extended' then we treat
the output from that monitor differently.   If it is set to 'legacy', or
not set, we expect the current behavior, and some of the new functionality
may not be precisely correct.

New style monitor scripts need to output extra data.   I have two ideas for
how to do this.  The first is to just output an extra line of data.
Instead of a one-line summary, and then a detail message, we output a
one-line summary, a one-line list of failed hosts, and then the detail
message.  This ensures maximum backwards compatibility, as these scripts
could just be used with older Mon servers with no problems.

On the other hand, we could take the philosophy that we're really
outputting structured data, and we might want to add more structure at a
later point.  So in that case maybe we should consider XML.  So the output
might look something like:

<?xml version="1.0" standalone="yes"?>
<monitoroutput name="monitor name here">
  <summary>Summary Text Here</summary>
  <host status="fail" name="host.name.here">Optional reason here</host>
  <host status="pass" name="host.name.here">Optional status data here</host>
  <detail>Detailed monitor output here</detail>
</monitoroutput

Now, I don't want to sound like an XML fan-boy here, so I'll admit that
right now, for the amount of extra data I'm talking about encorporating
into the output, the overhead of adding XML support seems a bit high.  On
the other hand, doing this now might make later extensions much easier.
Add for perl-based monitor scripts, their exist several modules which will
assist in creating the XML easily.  (Not that formatting something so
simple yourself is hard.)  I'm 50-50 on the issue, so I'll let the list
decide.  We could do both, by supporing 'monitortype host-extended' and
'monitortype xml', if people want both options.  That might be the best
option, but involves a little extra work.


Moving on to "what to do with per-host data" once we're actually getting it
from the monitor scripts:
Individual hosts going down should be logged to the downtime log.

Per host statuses should be passed to mon clients when they ask for
opstatus information.  (mon.cgi could use this to highlight the rows of
hosts which are failing on the hostgroup page.)

It should be possible to alert based on per-host data.  I would suggest
adding 'hostalert' and 'hostupalert' as per-period config options, and
calling the alert script once for every host that changes state.  So if
hosts A and B both start failing at the same time, the hostalert script
would be called twice, once for A, once for B.  If A then comes back while
B is still down, the hostupalert would be called for A.  Then sometime
later when B is working again, we'll run the hostupalert for B.

In conjuction with those changes, the regular alert semantics should change
slightly.  When alertevery is set, instead of re-alerting after any summary
change, by default we'll now re-alert only after the set of failing hosts
has gained a previously non-failing host.  (i.e. if we've already alerted
for A & B failing, and A starts working again, we shouldn't re-alert.
However if C starts failing we *should* re-alert.)   In order to support
the old behavior an 'observe_summary' option should be added to
'alertevery', similar to the current 'observe_detail'.

Finally, supporting true per-host dependencies should be possible.  I've
already actually done most of this work in my current source tree, but I'll
re-work it a bit to match the new model.  We'd like to have true per-host
dependencies (one machine from hostgroup X being unpingable should not
cause us to alert for other services failing on X, but should also not
cause us to ignore failures of the other services on other machines).
Think of this as the 'm' dependency behavior we have now, but on a per-host
basis.  The implementation involves generating a list of failed hosts in
all the direct dependencies of a hostgroup, and if any of those hosts match
a host in this hostgroup, not passing that host to the monitor.  In order
to support both this behavior and alert/monitor suppression behavior on the
same hostgroup, we need to split up the dependencies.  (An example of
wanting both is having per-host dependencies on multiple services on the
same hostgroup (http depends on ping), but also have alert suppression
based on other hostgroups/services (http & ping on the web group depends on
the appropriate router group being up))

To split up the dependencies, my current model is to make it three
different config options, 'alertdepend', 'monitordepend' and 'hostdepend'.
The old 'depend' and 'dep_behavior' options will still be supported, with a
new dep_behavior 'hm' available to mean per-host monitor excludes.  So for
my example above, on webservers:http you might have:
hostdepend SELF:ping
alertdepend machine-room-router:ping

OR you could have
depend machine-room-router:ping
dep_behavior a
hostdepend SELF:ping




Thats the proposal.  Discuss.  As I said, I'm willing to do most of the
work, I just want to give other people the chance to chime in with
suggestions for how to do it.


-David Nolan
 Network Software Developer
 Computing Services
 Carnegie Mellon University

_______________________________________________
mon mailing list
[EMAIL PROTECTED]
http://linux.kernel.org/mailman/listinfo/mon

_______________________________________________
mon mailing list
[EMAIL PROTECTED]
http://linux.kernel.org/mailman/listinfo/mon

RE: LONG Proposal: Making mon aware of individual host failures (WAS RE: n00b alert)

Reply via email to