Re: alerts functionality

2004-11-23 Thread Jim Trocki
On Fri, 19 Nov 2004, Joubin Moshrefzadeh wrote:

> host1 goes down - 1 alert sent
> then host2 goes down - 2 alerts sent
> then host3 goes down - 3 alerts sent
> etc...
> 
> so total alerts sent is 1+2+3...+10?
> 
> is the latter correct? I've only tested it up to two hosts going down 
> consecutively :)

it's correct depending on how you configure mon. this is the default
behavior, but you can change it.

i noticed the man page needed some updating, so i did so and check in the
changes to the cvs tree on the mon-1-0-0pre1 branch. the part which affects
this behavior is the "alertevery" parameters.  here's a summary:


ALERT DECISION LOGIC
   Upon a non-zero or zero exit status, the associated  alert  or  upalert
   program (respectively) is started, pending the following conditions: If
   an alert for a specific service is disabled, do not send an alert.   If
   dep_behavior  is  set  to 'a', and a parent dependency is failing, then
   suppress the alert.  If the alert has previously been acknowledged,  do
   not send the alert, unless it is an upalert.  If an alert is not within
   the specified period, record the failure via syslog(3) and do not  send
   an alert.  If the failure does not fall within a defined period, do not
   send an alert.  No upalerts are sent without corresponding down alerts,
   unless no_comp_alerts is defined in the period section. An upalert will
   only be sent if the previous state is  a  failure.   If  an  alert  was
   already  sent  within  the last alertevery interval and the monitor has
   continued to report a nonzero exit status for a time period  less  than
   that  interval,  do  not  send another alert, unless the summary output
   from the most recent monitor process differs from the previous.  Other-
   wise,  send  an  alert using each alert program listed for that period.
   The observe_detail argument to  alertevery  affects  this  behavior  by
   observing  the  changes in the detail part of the output in addition to
   the summary line.  If a monitor has successive failures and the summary
   output  changes  in each of them, alertevery will not suppress multiple
   consecutive alerts.  The  reasoning  is  that  if  the  summary  output
   changes,  then  a  significant  event  occurred  and the user should be
   alerted.  The "ignore_summary"  option  will  suppress  all  successive
   alerts  while the service continues to fail, even if the summary output
   changes.  If the "strict" alertevery option is used,  then  behave  the
   same  as  if  "ignore_summary" was set, but do not reset the alertevery
   timer when  the  monitor  exits  with  a  zero  status.   For  example,
   "alertevery  24h  strict"  will  only  send  out an alert once every 24
   hours, regardless of whether the monitor output changes, or if the ser-
   vice stops and then starts failing.

...

   alertevery timeval [observe_detail | ignore_summary | strict ]
  The alertevery keyword (within a period  definition)  takes  the
  same  type  of argument as the interval variable, and limits the
  number of times an alert is sent when the service  continues  to
  fail.   For example, if the interval is "1h", then the alerts in
  the period section will only be triggered once every hour as the
  service  continues  to fail.  The alertevery interval timer will
  be reset if the monitor stops exiting with a nonzero exit status
  (i.e. it reports a success).  If the alertevery keyword is omit-
  ted in a period entry, an alert will be sent out  every  time  a
  failure  is  detected.  By default, if the summary output of two
  successive failures changes, then  the  alertevery  interval  is
  overridden,  and  an  alert  will be sent.  The "ignore_summary"
  argument   suppresses   this   behavior. Ifthestring
  "observe_detail" is the last argument, then both the summary and
  detail output lines will be considered when comparing the output
  of  successive  failures.   If  the  string "strict" is the last
  argument, then the output of the monitor or the state change  of
  the  service  will  have no effect on when alerts are sent. That
  is, "alertevery 24h strict" will send only one  alert  every  24
  hours, no matter what.  Please refer to the ALERT DECISION LOGIC
  section for a detailed explanation of how alerts are suppressed.


___
mon mailing list
[EMAIL PROTECTED]
http://linux.kernel.org/mailman/listinfo/mon


Re: alerts functionality

2004-11-23 Thread David Nolan

--On Monday, November 22, 2004 1:45 PM -0500 Jim Trocki 
<[EMAIL PROTECTED]> wrote:

so total alerts sent is 1+2+3...+10?
is the latter correct? I've only tested it up to two hosts going down
consecutively :)
it's correct depending on how you configure mon. this is the default
behavior, but you can change it.
Also, it should be pointed out that this is entirely dependent on the 
behavior of the monitor script.  If the script outputs a different summary, 
then Mon will alert again (unless configured not to).  Most scripts output 
the list of failing hosts as the summary, but not all.

-David
David Nolan<*>[EMAIL PROTECTED]
curses: May you be forced to grep the termcap of an unclean yacc while
 a herd of rogue emacs fsck your troff and vgrind your pathalias!
___
mon mailing list
[EMAIL PROTECTED]
http://linux.kernel.org/mailman/listinfo/mon


Repeat sending of upalerts

2004-11-23 Thread Oliver Nyderle
Hi ...
I 've set up an a mon-server which receives traps from some client mon 
servers (send via remote.alert).

After restarting the main mon server, all services monitored by the 
client mon servers are in unknown-state until I restart the client 
server or a service on client server fails.

Is there a possibility to repeat sending upalerts to inform the main 
server that everything is ok and not unknown? Any other hints to get the 
correct status at the main mon server without restarting mon on every 
client?

Thanks for every hint ...
Regards
Oliver
--
 Oliver Nyderle <[EMAIL PROTECTED]>
 ANDURAS service solutions AG
 Innstraße 71 - 94036 Passau - Germany
 Web: www.anduras.de - Tel: +49 (0)851-4 90 50-0 - Fax: +49 (0)851-4 90 
50-55

 Rechtsform: Aktiengesellschaft - Sitz: Passau - Amtsgericht Passau HRB 
6032
 Mitglieder des Vorstands: Sven Anders, Marcus Junker, Michael Schön
 Vorsitzender des Aufsichtsrats: Dipl. Kfm. Karlheinz Antesberger

___
mon mailing list
[EMAIL PROTECTED]
http://linux.kernel.org/mailman/listinfo/mon


RE: alerts functionality

2004-11-23 Thread Gary Richardson
As a side note to this conversation, we do our configuration a wee bit
differently.

We require the ability to get stats, disable services and get alerts
specific to each host. Basically, we need everything to happen at the host
level.

In order to accomplish this, we map one host group to one host. The bonus to
this configuration is that we always know specifically which service on
which host is down when we receive a page. The pages we get look like
$host/$server - $short_summary\n$date.

If http on server a fails, we get one page plus pages every X minutes. If
server b's http fails, we get one page plus pages every X minutes.

The negatives, of course, are that we have a lot of perl forks happening on
our monitoring servers. 

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> On Behalf Of David Nolan
> Sent: Tuesday, November 23, 2004 7:31 AM
> To: mon_list
> Subject: Re: alerts functionality
> 
> 
> 
> --On Monday, November 22, 2004 1:45 PM -0500 Jim Trocki
> <[EMAIL PROTECTED]> wrote:
> 
> >> so total alerts sent is 1+2+3...+10?
> >>
> >> is the latter correct? I've only tested it up to two hosts going down
> >> consecutively :)
> >
> > it's correct depending on how you configure mon. this is the default
> > behavior, but you can change it.
> 
> Also, it should be pointed out that this is entirely dependent on the
> behavior of the monitor script.  If the script outputs a different
> summary,
> then Mon will alert again (unless configured not to).  Most scripts output
> the list of failing hosts as the summary, but not all.
> 
> -David
> 
> David Nolan<*>[EMAIL PROTECTED]
> curses: May you be forced to grep the termcap of an unclean yacc while
>   a herd of rogue emacs fsck your troff and vgrind your pathalias!
> 
> ___
> mon mailing list
> [EMAIL PROTECTED]
> http://linux.kernel.org/mailman/listinfo/mon

___
mon mailing list
[EMAIL PROTECTED]
http://linux.kernel.org/mailman/listinfo/mon


Re: Repeat sending of upalerts

2004-11-23 Thread David Nolan

--On Tuesday, November 23, 2004 5:01 PM +0100 Oliver Nyderle 
<[EMAIL PROTECTED]> wrote:

Hi ...
I 've set up an a mon-server which receives traps from some client mon
servers (send via remote.alert).
After restarting the main mon server, all services monitored by the
client mon servers are in unknown-state until I restart the client server
or a service on client server fails.
Is there a possibility to repeat sending upalerts to inform the main
server that everything is ok and not unknown? Any other hints to get the
correct status at the main mon server without restarting mon on every
client?
Thanks for every hint ...

I strongly recommend that you look at the current version of Mon in CVS. 
One of the new features is a mon config option called 'redistribute', which 
takes the name of an alert script and calls that alert script on *every* 
status update.  This is to allow you to redistribute complete status 
information to remote mon servers, or even into other systems.  We use it 
with a script that sends traps to other mon servers, to provide the 
complete status information to all our servers.

I actually just committed some updates to the CVS HEAD last week, pulling 
up some changes that Jim has done in the Mon 1.0 cvs branch.  I believe I'm 
about ready to tag this as mon-1.1pre and release a tarball.

-David Nolan
Network Software Designer
Computing Services
Carnegie Mellon University
___
mon mailing list
[EMAIL PROTECTED]
http://linux.kernel.org/mailman/listinfo/mon


Re: alerts functionality

2004-11-23 Thread Joubin Moshrefzadeh

> it's correct depending on how you configure mon. this is the default
> behavior, but you can change it.
> 
> ...

Thanks for the clarification Jim. 

___
mon mailing list
[EMAIL PROTECTED]
http://linux.kernel.org/mailman/listinfo/mon


RE: alerts functionality

2004-11-23 Thread David Nolan

--On Tuesday, November 23, 2004 8:47 AM -0800 Gary Richardson 
<[EMAIL PROTECTED]> wrote:

The negatives, of course, are that we have a lot of perl forks happening
on our monitoring servers.
My servers already fork 500K times a day...  That would be *so* much worse.
If only we had per-host status tracking.  (Its a big project... but it 
really needs to be done.  I wonder if I'll ever find the time to do it.)

-David Nolan
Network Software Designer
Computing Services
Carnegie Mellon University
___
mon mailing list
[EMAIL PROTECTED]
http://linux.kernel.org/mailman/listinfo/mon


Alert escalation even if acknowledged

2004-11-23 Thread Michael Vogt
(Sorry if this is a dup. Append resent after 17 hours without confirm.)

I'm trying to meet a request to have mon send an alert first to the on
call person, and then if the problem is not resolved within a certain
time, T2, to a higher level person.  The sticky part is that this
should occur even if the condition was ack-ed.

If not for the ignoring of the ack, I see that I could do this with a
second alert entry (under another period label:) using something like
"alertafter T2".

In order to ignore the ack, it looks like I would have to set up a
duplicate group:service with the same test but, again, "alertafter
T2" 

Alternatively, it looks like I could easily tweak the code to add a new
alert type, alertnoack, which acts like alert but disregards the ack
state (and maybe passes the alert plugin the ack state and comment). I
really don't want to go off on my own.  Does this option have any
merit?
An even more elaborate change would have acknowledgment levels and
something like an "alertbelowack num" clause that indicates what level
of ack squelches that alert (e.g. default is num=1).


Are there any other solutions that don't involve configuring a
"duplicate" group:service or changing mon?

Thanks for any suggestions,

Michael Vogt


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

___
mon mailing list
[EMAIL PROTECTED]
http://linux.kernel.org/mailman/listinfo/mon