RE: LONG Proposal: Making mon aware of individual host failures(WAS RE: n00b alert)

David Nolan Tue, 17 Sep 2002 11:17:28 -0700

--On Tuesday, September 17, 2002 8:05 AM -0700 Jim Trocki 
<[EMAIL PROTECTED]> wrote:

> does it recognize when individual service parameters have changed after
> a reload, and reset the state of those?
>

I'm trying to do the 'right' thing as much as possible.  Which means I'm 
only loading 'state' information, not 'config' information.  What that 
really translates to is I'm saving and loading (per hostgroup/service):
op_status
failure_count
alert_count
last_success
consec_failures
last_failure
first_failure
last_summary
last_detail
ack
ack_comment
last_trap
last_traphost  (something I've added, the IP of the host the trap was 
received from.)
exitval
last_check
last_op_status

and for each period:
last_alert
alert_sent
1stfailtime
failcount


If the relevant hostgroup/service/period doesn't exist in the reloaded 
config, I ignore the saved data.

So far I'm pretty happy with the results.  I've seen a couple strange 
behaviors, which I'm trying to track down still.

> regarding the per-host opstatus, lmb had written some code (Mon::Protocol)
> long ago which can be found in the Mon-0.11.tar.gz perl module. it's
> a start, at least. it defines methods for encoding per-host status
> (var=val) to be passed to/from the mon server.
>

I'm looking at that now, and I'm not really sure how its meant to be used. 
The documentation is pretty light, and doesn't provide any real examples. 
Is it just supposed to be a way to make it easier to parse the data given 
to a client by 'list opstatus', etc, and easy to generate data in the same 
format?

Are you intending for monitor scripts to start sending extra data to mon 
over the Mon socket, instead of via STDOUT?  While some object to the 
STDOUT method, I think its cleaner then the socket based approach, and I 
see several potential problems with the socket based approach.  (For 
example: Every monitor script will have to be aware of the name of the 
hostgroup/service it is being run on, and a slightly broken script might 
start changing the status of other services by accident.  Thats just one 
simple example, I can come up with more...)

Are you particularly attached to that format for passing extra data between 
monitors and mon?  Personally I find the XML based format I posted 
yesterday much more palatable, but I'm flexible either way.

> it's not a small amount of work to add this capability, but it's not a
> massive amount of work, either.

It'll require touching code all over Mon, but I don't think it'll be a huge 
problem.  It'll probably take me a couple days to get it all done, and a 
couple more after that to do the first round of testing.

>
> keep in mind that mon was intended to monitor things other than just
> "hosts", so the mechanisms shouldn't force the "host" model people
> have been talking about. maybe it's just a matter of terminology.
>

I've just been using the same terminology that mon uses.  'hostgroups' and 
'hosts'.  In fact the Mon manpage doesn't reflect what you say the intent 
was.  The manpage says a hostgroup is 'A single host or list of hosts, 
specified as names or IP addresses'.  My impression has always been that 
putting things other then hosts in hostgroups is really just taking 
advantage of the fact that mon doesn't enforce that the hostnames must 
actually look like hostnames.  (i.e mon allows any non whitespace character 
in a 'hostname')  It seems like its an unintentional "feature".

That said, I don't think that the behaviors I'm proposing will make any 
changes at all to that "feature".  Replace 'host' with 
'possibly-host-like-object' (or just 'object') in my post and everything 
still seems logical.  (per-possibly-host-like-object dependencies ?  :)




-David Nolan
 Network Software Developer
 Computing Services
 Carnegie Mellon University

_______________________________________________
mon mailing list
[EMAIL PROTECTED]
http://linux.kernel.org/mailman/listinfo/mon
RE: LONG Proposal: Making mon aware of individual host failures(WAS RE: n00b alert)

Reply via email to