Your final question solution: Ok, that works with the specific example I gave. 
But what if the reason for the down changes to a "real" problem on the third or 
fourth (etc...) check rather than on the second? In that case, surely an alert 
will not be sent? 

Brain bender: what's your guesstimate for when the boolean check will be 
available? 

Cheers, 

Ian 
_________________________________
Ian K Gray
OEL IS - European Infrastructure Support
Tel: +44 1236 502661
Mob: +44 7881 518854
Ad eundum quo nemo ante iit 


"Dirk" <[EMAIL PROTECTED]> 
Sent by: Servers Alive Discussion List <salive@woodstone.nu> 

24/10/2008 12:50 Please respond to
Servers Alive Discussion List <salive@woodstone.nu> 
To Servers Alive Discussion List <salive@woodstone.nu> cc Subject RE: [SA-list] 
Complex check/alert logic 




I'll start with the "final question" :-) 
               If the entry is down (whatever reason) then the 
internal_down_counter is increased with 1 (in your example on the 1st down due 
to timeout the counter is 1) 
               the next check it's again down (counter becomes 2) due to the 
real error (not a timeout) 
               => as the alert is set to be send on the 2nd down (and not on 
timeout) the alert will be send 
 
For the "brain bender", I think the boolean check might work: 
 
Boolean check would give a  DOWN        if check_1 = DOWN        and        
check_2 = DOWN            -> alert on that one 
               this check would be UP in all other cases and then "back-up" 
could even be send if check_1 = down and check_2  = up and is this OK? 
 
 
 
 
 
 
Dirk Bulinckx. 
From: Servers Alive Discussion List [mailto:[EMAIL PROTECTED] On Behalf Of 
[EMAIL PROTECTED]
Sent: Friday, October 24, 2008 1:01 PM
To: Servers Alive Discussion List
Subject: [SA-list] Complex check/alert logic 
 

Hi Dirk et al, 

Ready for a brain bender? Any ideas on the following scenario: 

* I run checks on an external website (http get). To avoid false alarms due to 
problems with the local proxy server or internet connection, I run a second 
check (dependent on the first check being DOWN) via the WAN to a different 
proxy server. Alerts only go out if the second check fails. 
* Because the dreaded timeouts generate false alerts every now and then, I only 
alert after two downs, and even then only if the word "timeout" is not returned 
in %e. 
* However (and here's where it gets tricky), I need to send out an "all clear" 
alert when the checks show the website is up again. If I configure an all clear 
alert against the second check, it typically won't ever get sent (because 
assuming it *wasn't* a proxy or internet line problem, once the website works 
again the first check will show as UP, and the second check won't be run). 
* If I configure the all clear alert within the first check instead, it would 
send out an all clear for the website when the first check shows as ok, even if 
the second check never went down. (Are you following this so far?) 
* The problem is compounded when you consider the timeouts problem. Let's say 
the second check fails once because of a timeout (effectively a false alarm), 
or fails several times but doesn't send out the alert because the word 
"timeout" suppresses it. The all clear alert (whether it is configured in the 
first or the second check) will still be sent out, and recipients (who never 
had a DOWN alert because it was just a timeout) will get very confused and 
wonder what's going on.

So I guess the issues here are: 
1.        Finding some way of only sending an all clear message if the original 
down was for more than x cycles, or suppressing it if the previous %e contained 
"timeout". 
2.        Sending an all clear alert when a check returns to "unavailable" from 
"down", rather than just to "up" from "down". 
3.        And of course, the ongoing problem of dealing neatly with timeouts 
rather than "real" downs. 

One final question to pose here. Let's say a check is configured to send an 
alert when something goes down, but not if the %e includes the word "timeout". 
First time the check fails, it is a timeout, and therefore the alert is 
suppressed. But let's say that the timeout is in fact a precursor to a real 
problem, and on the next check cycle the check fails with a different ("real") 
error. Now an alert should be sent out - but because the status hasn't changed 
(it was down last time, and it's still down), an alert won't be sent out (or at 
least I presume not). How do we deal with this? I've still got the idea in the 
back of my mind about SA having a different status (i.e. not just UP or DOWN) 
for timeouts... 

Cheers, 

Ian 
_________________________________
Ian K Gray
OEL IS - European Infrastructure Support
Tel: +44 1236 502661
Mob: +44 7881 518854
Ad eundum quo nemo ante iit 
______________________________________________________________________________


Any opinions expressed in this email are those of the individual and not 
necessarily of the Company. This email and any files transmitted with it, 
including replies and forwarded copies (which may contain alterations) 
subsequently transmitted from the Company are confidential and solely for the 
use of the intended recipient. It may contain material protected by legal 
privilege. If you are not the intended recipient or the person responsible for 
delivering to the intended recipient, be advised that you have received this 
email in error and that any use is strictly prohibited.
Please notify the sender immediately of the error and delete any copies of this 
message

Warning: Although the Company has taken reasonable precautions to ensure that 
no viruses are present in this e-mail, the Company cannot accept responsibility 
for any loss or damage arising from the use of this e-mail or attachments.


To unsubscribe send a message with UNSUBSCRIBE in the subject line to 
salive@woodstone.nu
If you use auto-responders (like out-of-the-office messages), make sure that 
they are not sent to the list nor to individual members. Doing so will cause 
you to be automatically removed from the list. 



To unsubscribe send a message with UNSUBSCRIBE in the subject line to 
salive@woodstone.nu
If you use auto-responders (like out-of-the-office messages), make sure that 
they are not sent to the list nor to individual members. Doing so will cause 
you to be automatically removed from the list. 


______________________________________________________________________________


Any opinions expressed in this email are those of the individual and not 
necessarily of the Company. This email and any files transmitted with it, 
including replies and forwarded copies (which may contain alterations) 
subsequently transmitted from the Company are confidential and solely for the 
use of the intended recipient. It may contain material protected by legal 
privilege. If you are not the intended recipient or the person responsible for 
delivering to the intended recipient, be advised that you have received this 
email in error and that any use is strictly prohibited.
Please notify the sender immediately of the error and delete any copies of this 
message

Warning: Although the Company has taken reasonable precautions to ensure that 
no viruses are present in this e-mail, the Company cannot accept responsibility 
for any loss or damage arising from the use of this e-mail or attachments.


To unsubscribe send a message with UNSUBSCRIBE in the subject line to 
salive@woodstone.nu
If you use auto-responders (like out-of-the-office messages), make sure that 
they are not sent to the list nor to individual members. Doing so will cause 
you to be automatically removed from the list.

Reply via email to