Re: LONG Proposal: Making mon aware of individual host failures (WASRE: n00b alert)

Andrew Ryan Tue, 17 Sep 2002 18:42:10 -0700


On balance I think this is mostly very positive and forward
progress. Comments inline.

On Mon, 16 Sep 2002, David Nolan wrote:

> Proposal:
> Mon needs to be aware of the status of individual hosts.

Yes, agreed, this is long overdue. As long as the default behavior is the
same as it is now.

> Accurate downtime logs:  Right now if hosts A, B, and C are all in the same
> hostgroup, and A fails, followed by B and C failing, the downtime log will
> only show A.

This is what I'm most interested in from this change. Can you share your
proposed new downtime log format, based on the new significance of hosts?

If you're also overhauling the downtime log format for content (which you
will have to), you should also generate downtime log events for server
shutdown and startup, and preferably for disabling/enabling as well.

>
>
> Proposed Strategy:
> In order to solve this problem, the monitor scripts will have to pass more
> data back to Mon.  However any changes to Mon and the scripts should
> probably try to be backwards compatible.

I won't quote Yoda here, but I will say that backwards compatibility is
pretty important, especially for a product in as wide use as mon.

Also be aware that if you present different monitor formats, you risk
forking monitor development. As the maintainer of the mon contribs, I'm
particularly sensitive to this. A possibility might be if the monitors
defaulted to the old style, and were patched to allow an '--extended'
mode, but that is a lot of work.

And keep in mind that most installations of mon have significant work
in custom monitors which they have not contributed back to the community
because of lack of widespread utility or licensing/management issues. So
even if you were to modify all the monitors which come with mon, and a
good portion of the popular contrib'ed monitors, that still leaves a lot
of monitors out there.

Since virtually all monitors use the summary line to output the list of
failed hosts, can't your patches assume that is the list of failed hosts
(rejecting output that doesn't look like a host in the hostgroup)? That's
what I did in dtquery, and it worked pretty well.

>
> I suggest we add a new optional service level mon.cfg option,
> 'monitortype'.  If 'monitortype' is set to 'host-extended' then we treat
> the output from that monitor differently.   If it is set to 'legacy', or
> not set, we expect the current behavior, and some of the new functionality
> may not be precisely correct.
>
> New style monitor scripts need to output extra data.   I have two ideas for
> how to do this.  The first is to just output an extra line of data.
> Instead of a one-line summary, and then a detail message, we output a
> one-line summary, a one-line list of failed hosts, and then the detail
> message.  This ensures maximum backwards compatibility, as these scripts
> could just be used with older Mon servers with no problems.
>
> On the other hand, we could take the philosophy that we're really
> outputting structured data, and we might want to add more structure at a
> later point.  So in that case maybe we should consider XML.  So the output
> might look something like:

I like XML as much as the next sysadmin, but the moment you commit to
using it in monitors and/or the server, you both add a significant amount
of both knowledge and software prerequisites, as well as create more
complicated dependencies. I believe it's important for Joe/Jane Sysadmin
to be able to get up and running with mon within an hour or two -- that's
why most of us started using mon.

I personally don't believe the case here is compelling enough, and that
whitespace and CR/LF can be used as effectively as they have always been
by mon.

>
> Now, I don't want to sound like an XML fan-boy here, so I'll admit that
> right now, for the amount of extra data I'm talking about encorporating
> into the output, the overhead of adding XML support seems a bit high.  On
> the other hand, doing this now might make later extensions much easier.
> Add for perl-based monitor scripts, their exist several modules which will
> assist in creating the XML easily.  (Not that formatting something so
> simple yourself is hard.)

You do have to worry about making sure your detail output is properly
escaped. It's more work than most people would want to get into.

>  I'm 50-50 on the issue, so I'll let the list
> decide.  We could do both, by supporing 'monitortype host-extended' and
> 'monitortype xml', if people want both options.  That might be the best
> option, but involves a little extra work.

The danger I see here is that some percentage of people start writing
XML-output monitors, and some stick with ASCII. When people come to the
contrib archive, or download mon, and they find this variance in
monitors, they're going to be confused (rightfully so) and possibly peeved
(again rightfully so).

>
> Per host statuses should be passed to mon clients when they ask for
> opstatus information.  (mon.cgi could use this to highlight the rows of
> hosts which are failing on the hostgroup page.)

Yeah, this is excellent, it would allow us to present even more useful
information on the page in less real estate.

>
> It should be possible to alert based on per-host data.  I would suggest
> adding 'hostalert' and 'hostupalert' as per-period config options, and
> calling the alert script once for every host that changes state.

That's cool, then I'd also recommend another parameter which would allow
some 'grace period' so hosts which fail in short succession could have
their alerts batched. Although this adds a fair amount of complexity, so
it might not be easy.

> hosts A and B both start failing at the same time, the hostalert script
> would be called twice, once for A, once for B.  If A then comes back while
> B is still down, the hostupalert would be called for A.  Then sometime
> later when B is working again, we'll run the hostupalert for B.
>
> In conjuction with those changes, the regular alert semantics should change
> slightly.  When alertevery is set, instead of re-alerting after any summary
> change, by default we'll now re-alert only after the set of failing hosts
> has gained a previously non-failing host.  (i.e. if we've already alerted
> for A & B failing, and A starts working again, we shouldn't re-alert.
> However if C starts failing we *should* re-alert.)   In order to support
> the old behavior an 'observe_summary' option should be added to
> 'alertevery', similar to the current 'observe_detail'.

Yeah, sounds good. What you call 'observe_summary' is just a hack to
achieve the same alert semantics you're talking about implementing.

>
> Finally, supporting true per-host dependencies should be possible.

I don't think you're going to get too much argument from any mon user on
that :) Your examples and proposed implementation both look good here,
although I might have more questions once I see it in practice.

>
> Thats the proposal.  Discuss.  As I said, I'm willing to do most of the
> work, I just want to give other people the chance to chime in with
> suggestions for how to do it.

Hey this is great. It's nice to see someone from a large environment using
mon and also addressing some of its longstanding scalability problems.

Maybe this will finally push mon over the edge to 1.0

:)

andrew

_______________________________________________
mon mailing list
[EMAIL PROTECTED]
http://linux.kernel.org/mailman/listinfo/mon

Re: LONG Proposal: Making mon aware of individual host failures (WASRE: n00b alert)

Reply via email to