Hi, all, I have been getting an incredible number of false positive pages recently. I have to believe that it's something having to do with my application, but most of the pages I get correct themselves one cycle later. I put in a test to hit google on port 80, and even that paged me once in the middle of the night.
This pissed me off enough to do something about it. Reading through the list archives, I found this post: http://lists.gnu.org/archive/html/monit-general/2005-04/msg00016.html It gave me a nice idea (and I followed his example) but I really didn't like the fact that after a single failure, the service requires human intervention to restart monitoring (since the timeout function disables monitoring for that service). So I started making code changes. Unfortunately, I didn't do it the *right* way, because it's been way too long since I played with flex etc. Instead, I took advantage of the "if x restarts in y cycles then timeout," but eviscerated the ACTION_TIMEOUT functionality. It no longer actually times out, it just alerts. What I really wanted was "if x restarts in y cycles then alert," but I couldn't figure the right way to do it. Since the timeout funcitonality was designed to start counting at the first failure, and if a service actually times out, stop monitoring it, the counter manipulation didn't work so well when timeouts could be triggered and recovered often. The end result: I have a rule like: set alert [EMAIL PROTECTED] {timeout} check host RadixTest with address cryptio.net start program = "/bin/true" stop program = "/bin/true" if 2 restarts within 3 cycles then timeout if failed url http://cryptio.net/~ben/lilo.conf and content == "default=linuxprep" then restart and I only get paged if it fails twice within three attempts. (i.e.: fail pass fail == page fail pass pass == no page fail fail pass == page) I also made it decrement the pass-counter slowly, so that fail pass fail pass fail == faiure-page, recovery-page, failure-page i.e. if it's recently failed, be more paranoid. One annoyance is that the check_timeout function comes before the service test instead of afterwards, so I actually get paged at the beginning of the cycle following the failure condition. I'm cheking every 60 seconds, so I can deal with that. A correct solution woludn't exibit this problem... ;) Another less-than-desireable trait - IMHO, the right way to do this kind of thing is to use the leaky-bucket algorithm (as many network protocols do) that says failures add up quickly but subside slowly. You would have to specify a rate at which the failure counter drops in addition to the thresholds. i.e. 5 failures within 10 attempts, decrease failure counter at a rate of 1/5 successes. This allows a certain amount of flakyness, but alerts you quickly on a hard failure, and alerts you if it gets too flaky. Anyway... In case any of you are interested, I have attached a patch of the modifications I made (to the head of the CVS tree) -ben p.s. is this the right list? or should I have posted this to the monit-general? It seems much more high volume -- all I see go by here are announcements of newly checked in files... -- Ben Hartshorne email: [EMAIL PROTECTED] http://ben.hartshorne.net
Index: p.y
===================================================================
RCS file: /cvsroot/monit/monit/p.y,v
retrieving revision 1.208
diff -c -b -r1.208 p.y
*** p.y 3 Apr 2005 11:56:51 -0000 1.208
--- p.y 28 Jul 2005 20:51:00 -0000
***************
*** 1562,1568 ****
addeventaction(&(current)->action_EXEC, ACTION_ALERT, ACTION_ALERT);
addeventaction(&(current)->action_INVALID, ACTION_RESTART, ACTION_ALERT);
addeventaction(&(current)->action_NONEXIST, ACTION_RESTART, ACTION_ALERT);
! addeventaction(&(current)->action_TIMEOUT, ACTION_UNMONITOR, ACTION_ALERT);
addeventaction(&(current)->action_PID, ACTION_ALERT,
ACTION_IGNORE);
addeventaction(&(current)->action_PPID, ACTION_ALERT,
ACTION_IGNORE);
--- 1562,1568 ----
addeventaction(&(current)->action_EXEC, ACTION_ALERT, ACTION_ALERT);
addeventaction(&(current)->action_INVALID, ACTION_RESTART, ACTION_ALERT);
addeventaction(&(current)->action_NONEXIST, ACTION_RESTART, ACTION_ALERT);
! addeventaction(&(current)->action_TIMEOUT, ACTION_ALERT, ACTION_ALERT);
addeventaction(&(current)->action_PID, ACTION_ALERT,
ACTION_IGNORE);
addeventaction(&(current)->action_PPID, ACTION_ALERT,
ACTION_IGNORE);
Index: validate.c
===================================================================
RCS file: /cvsroot/monit/monit/validate.c,v
retrieving revision 1.140
diff -b -c -r1.140 validate.c
*** validate.c 11 May 2005 21:28:02 -0000 1.140
--- validate.c 28 Jul 2005 20:52:09 -0000
***************
*** 500,505 ****
--- 500,511 ----
p->is_available= TRUE;
Event_post(s, EVENT_CONNECTION, FALSE, p->action,
"'%s' connection passed", s->name);
+ if(s->ncycle == s->to_cycle){
+ s->nstart--;
+ }
+ if(s->nstart < 0){
+ s->nstart = 0;
+ }
}
}
***************
*** 1045,1073 ****
if(!s->def_timeout)
return FALSE;
/*
* Start counting cycles
*/
if(s->nstart > 0)
s->ncycle++;
/*
* Check timeout
*/
if(s->nstart >= s->to_start && s->ncycle <= s->to_cycle) {
Event_post(s, EVENT_TIMEOUT, TRUE, s->action_TIMEOUT,
! "'%s' service timed out and will not be checked anymore",
s->name);
- return TRUE;
}
/*
* Stop counting and reset if the
* cycle interval is passed
*/
! if(s->ncycle > s->to_cycle) {
s->ncycle= 0;
! s->nstart= 0;
}
return FALSE;
--- 1051,1093 ----
if(!s->def_timeout)
return FALSE;
+ /*log( "ns=%d, ts=%d, nc=%d, tc=%d\n", s->nstart, s->to_start, s->ncycle,
s->to_cycle);*/
/*
* Start counting cycles
*/
if(s->nstart > 0)
s->ncycle++;
+ /* make sure counters don't exceed set limits */
+ if(s->ncycle > s->to_cycle){
+ s->ncycle = s->to_cycle;
+ }
+ if(s->nstart > s->to_start){
+ s->nstart = s->to_start;
+ }
/*
* Check timeout
*/
if(s->nstart >= s->to_start && s->ncycle <= s->to_cycle) {
Event_post(s, EVENT_TIMEOUT, TRUE, s->action_TIMEOUT,
! "'%s' service timed out",
! s->name);
! /* return TRUE; */
! return FALSE;
! }else{
! Event_post(s, EVENT_TIMEOUT, FALSE, s->action_TIMEOUT,
! "'%s' service timeout recovered.",
s->name);
}
/*
* Stop counting and reset if the
* cycle interval is passed
*/
! /* if(s->ncycle > s->to_cycle) { */
! if(s->nstart == 0){
s->ncycle= 0;
! /* s->nstart= 2; */
}
return FALSE;
signature.asc
Description: Digital signature
_______________________________________________ monit-dev mailing list [email protected] http://lists.nongnu.org/mailman/listinfo/monit-dev
