Re: [Simple-evcorr-users] Trying to report extended NFS problems along with an OK.

John P. Rouillard Tue, 18 Feb 2014 15:30:07 -0800

In message <[email protected]>,
"Douglas K. Rand" writes:
>With the usual BSD syslog messages related to NFS problems:
>
>hostA kernel: nfs server hostb:/filesystem: not responding
>...
>hostA kernel: nfs server hostb:/filesystem: is alive again
>
>What I'm trying to do is generate an alert email if we see a "not 
>responding" message with out a corresponding "is alive again" within 60 
>seconds. But if we sent out the alert email I also want to send an 
>all-clear email when we do eventually get the "alive again" message, 
>perhaps even hours later.
>
>I've figured out how to do one or the other: I can generate the alert 
>email if a "not responding" message is not followed by a "alive again" 
>message with in 60 seconds.
>
>And I can generate the all-clear message for each and every "alive 
>again" message.
>
>But putting them together is stumping me. I only want to send the 
>all-clear message if we already have sent out the alert email; but if we 
>don't send out the alert there is no reason to send out the all-clear.
>
>I was thinking if when the "alive again" message came in if I could do 
>something like:
>
>if ($age > 60) report context mail -s "NFS all-clear" rand;
>delete context
>
>I'm not even sure that is the right approach, seems to not fit into SEC, 
>even if I could figure out how to do it.



Try this:

desc = send clear for $1:$2 if we sent an alert
type = single
ptype = regexp
pattern = nfs server ([^:]*):(/[^:]*): is alive again
context = waiting_for_$1:$2_to_become_alive_again
action = actions needed to send all clear; \
         delete waiting_for_$1:$2_to_become_alive_again

desc = detect not responding nfs server:filesystem $1:$2
type = pairwithwindow
ptype = regexp
pattern = nfs server ([^:]*):(/[^:]*): not responding
action = do the action that reports that there is a failure;
         create waiting_for_$1:$2_to_become_alive_again
ptype2 = substr
pattern2 = nfs server $1:$2: is alive again
desc2 = found the recover for $1:$2 within 60 seconds
action2 = none
window = 60

Basically the idea is when you send an alert create a context that
turns on a single rule that will send the clear for that
host:filesystem when it sees the corresponding "is alive again" event.

The downside of this pair of events is that it may be possible to get
the following:

delay    event
(sec)

0       "not responding"
60      send alert/create context
61      "not responding"
62      "is alive"

the "is alive" event at delay 62 will be captured by the single rule
and the pairwithwindow rule will generate a new alert. If you add
continue=takenext to the single rule, that will get around this issue,
but may cause other rules in your ruleset to trigger.

This should give you an idea of how to do it and you can adapt it to
the particular rules you are using to match the original not
responding/is alive pair of events.

--
                                -- rouilj
John Rouillard
===========================================================================
My employers don't acknowledge my existence much less my opinions.

------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121054471&iu=/4140/ostg.clktrk
_______________________________________________
Simple-evcorr-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/simple-evcorr-users

Re: [Simple-evcorr-users] Trying to report extended NFS problems along with an OK.

Reply via email to