failing more than once before alert

Ben Hartshorne Thu, 28 Jul 2005 15:06:52 -0700

Hi, all,

I have been getting an incredible number of false positive pages
recently.  I have to believe that it's something having to do with my
application, but most of the pages I get correct themselves one cycle
later.  I put in a test to hit google on port 80, and even that paged me
once in the middle of the night.


This pissed me off enough to do something about it.  Reading through the
list archives, I found this post:
http://lists.gnu.org/archive/html/monit-general/2005-04/msg00016.html
It gave me a nice idea (and I followed his example) but I really didn't
like the fact that after a single failure, the service requires human
intervention to restart monitoring (since the timeout function disables
monitoring for that service).  

So I started making code changes.  Unfortunately, I didn't do it the
*right* way, because it's been way too long since I played with flex
etc.  Instead, I took advantage of the "if x restarts in y cycles then
timeout," but eviscerated the ACTION_TIMEOUT functionality.  It no
longer actually times out, it just alerts.  

What I really wanted was "if x restarts in y cycles then alert," but I
couldn't figure the right way to do it.

Since the timeout funcitonality was designed to start counting at the
first failure, and if a service actually times out, stop monitoring it,
the counter manipulation didn't work so well when timeouts could be
triggered and recovered often.  

The end result:  I have a rule like:
set alert [EMAIL PROTECTED] {timeout}
check host RadixTest with address cryptio.net
        start program = "/bin/true"
        stop program = "/bin/true"
        if 2 restarts within 3 cycles then timeout
                if failed url http://cryptio.net/~ben/lilo.conf
                      and content == "default=linuxprep"
                      then restart

and I only get paged if it fails twice within three attempts.
(i.e.:
fail pass fail == page
fail pass pass == no page
fail fail pass == page)

I also made it decrement the pass-counter slowly, so that 
fail pass fail pass fail == faiure-page, recovery-page, failure-page
i.e. if it's recently failed, be more paranoid.

One annoyance is that the check_timeout function comes before the
service test instead of afterwards, so I actually get paged at the
beginning of the cycle following the failure condition.  I'm cheking
every 60 seconds, so I can deal with that.  A correct solution woludn't
exibit this problem...  ;)

Another less-than-desireable trait - IMHO, the right way to do this
kind of thing is to use the leaky-bucket algorithm (as many network
protocols do) that says failures add up quickly but subside slowly.  You
would have to specify a rate at which the failure counter drops in
addition to the thresholds.
i.e. 5 failures within 10 attempts, decrease failure counter at a rate
of 1/5 successes.

This allows a certain amount of flakyness, but alerts you quickly on a
hard failure, and alerts you if it gets too flaky.

Anyway...   In case any of you are interested, I have attached a patch
of the modifications I made (to the head of the CVS tree)

-ben

p.s.  is this the right list? or should I have posted this to the
monit-general?  It seems much more high volume -- all I see go by here
are announcements of newly checked in files...

-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net

Index: p.y
===================================================================
RCS file: /cvsroot/monit/monit/p.y,v
retrieving revision 1.208
diff -c -b -r1.208 p.y
*** p.y 3 Apr 2005 11:56:51 -0000       1.208
--- p.y 28 Jul 2005 20:51:00 -0000
***************
*** 1562,1568 ****
    addeventaction(&(current)->action_EXEC,     ACTION_ALERT,     ACTION_ALERT);
    addeventaction(&(current)->action_INVALID,  ACTION_RESTART,   ACTION_ALERT);
    addeventaction(&(current)->action_NONEXIST, ACTION_RESTART,   ACTION_ALERT);
!   addeventaction(&(current)->action_TIMEOUT,  ACTION_UNMONITOR, ACTION_ALERT);
    addeventaction(&(current)->action_PID,      ACTION_ALERT,     
ACTION_IGNORE);
    addeventaction(&(current)->action_PPID,     ACTION_ALERT,     
ACTION_IGNORE);
    
--- 1562,1568 ----
    addeventaction(&(current)->action_EXEC,     ACTION_ALERT,     ACTION_ALERT);
    addeventaction(&(current)->action_INVALID,  ACTION_RESTART,   ACTION_ALERT);
    addeventaction(&(current)->action_NONEXIST, ACTION_RESTART,   ACTION_ALERT);
!   addeventaction(&(current)->action_TIMEOUT,  ACTION_ALERT,     ACTION_ALERT);
    addeventaction(&(current)->action_PID,      ACTION_ALERT,     
ACTION_IGNORE);
    addeventaction(&(current)->action_PPID,     ACTION_ALERT,     
ACTION_IGNORE);
    
Index: validate.c
===================================================================
RCS file: /cvsroot/monit/monit/validate.c,v
retrieving revision 1.140
diff -b -c -r1.140 validate.c
*** validate.c  11 May 2005 21:28:02 -0000      1.140
--- validate.c  28 Jul 2005 20:52:09 -0000
***************
*** 500,505 ****
--- 500,511 ----
      p->is_available= TRUE;
      Event_post(s, EVENT_CONNECTION, FALSE, p->action,
        "'%s' connection passed", s->name);
+               if(s->ncycle == s->to_cycle){
+                       s->nstart--;
+               }
+               if(s->nstart < 0){
+                       s->nstart = 0;
+               }
    }
        
  }
***************
*** 1045,1073 ****
    if(!s->def_timeout)
      return FALSE;
  
    /*
     * Start counting cycles
     */
    if(s->nstart > 0)
      s->ncycle++;
  
    /*
     * Check timeout
     */
    if(s->nstart >= s->to_start && s->ncycle <= s->to_cycle) {
      Event_post(s, EVENT_TIMEOUT, TRUE, s->action_TIMEOUT,
!               "'%s' service timed out and will not be checked anymore",
                s->name);
-     return TRUE;
    }
  
    /*
     * Stop counting and reset if the
     * cycle interval is passed
     */
!   if(s->ncycle > s->to_cycle) {
      s->ncycle= 0;
!     s->nstart= 0;
    }
  
    return FALSE;
--- 1051,1093 ----
    if(!s->def_timeout)
      return FALSE;
  
+ /*log( "ns=%d, ts=%d, nc=%d, tc=%d\n", s->nstart, s->to_start, s->ncycle, 
s->to_cycle);*/
    /*
     * Start counting cycles
     */
    if(s->nstart > 0)
      s->ncycle++;
  
+       /* make sure counters don't exceed set limits */
+       if(s->ncycle > s->to_cycle){
+               s->ncycle = s->to_cycle;
+       }
+       if(s->nstart > s->to_start){
+               s->nstart = s->to_start;
+       }
    /*
     * Check timeout
     */
    if(s->nstart >= s->to_start && s->ncycle <= s->to_cycle) {
      Event_post(s, EVENT_TIMEOUT, TRUE, s->action_TIMEOUT,
!               "'%s' service timed out",
!               s->name);
! /*    return TRUE; */
!               return FALSE;
!   }else{
!     Event_post(s, EVENT_TIMEOUT, FALSE, s->action_TIMEOUT,
!               "'%s' service timeout recovered.",
                s->name);
        }       
  
    /*
     * Stop counting and reset if the
     * cycle interval is passed
     */
! /*  if(s->ncycle > s->to_cycle) { */
!       if(s->nstart == 0){
      s->ncycle= 0;
! /*    s->nstart= 2; */
    }
  
    return FALSE;

signature.asc
Description: Digital signature

_______________________________________________
monit-dev mailing list
[email protected]
http://lists.nongnu.org/mailman/listinfo/monit-dev

failing more than once before alert

Reply via email to