On 02/18/14 17:27, John P. Rouillard wrote:
>
> In message <[email protected]>,
> "Douglas K. Rand" writes:
>> With the usual BSD syslog messages related to NFS problems:
>>
>> hostA kernel: nfs server hostb:/filesystem: not responding
>> ...
>> hostA kernel: nfs server hostb:/filesystem: is alive again
>>
>> What I'm trying to do is generate an alert email if we see a "not
>> responding" message with out a corresponding "is alive again" within 60
>> seconds. But if we sent out the alert email I also want to send an
>> all-clear email when we do eventually get the "alive again" message,
>> perhaps even hours later.
>>
>> I've figured out how to do one or the other: I can generate the alert
>> email if a "not responding" message is not followed by a "alive again"
>> message with in 60 seconds.
>>
>> And I can generate the all-clear message for each and every "alive
>> again" message.
>>
>> But putting them together is stumping me. I only want to send the
>> all-clear message if we already have sent out the alert email; but if we
>> don't send out the alert there is no reason to send out the all-clear.
>>
>> I was thinking if when the "alive again" message came in if I could do
>> something like:
>>
>> if ($age > 60) report context mail -s "NFS all-clear" rand;
>> delete context
>>
>> I'm not even sure that is the right approach, seems to not fit into SEC,
>> even if I could figure out how to do it.
>
>
> Try this:
>
> desc = send clear for $1:$2 if we sent an alert
> type = single
> ptype = regexp
> pattern = nfs server ([^:]*):(/[^:]*): is alive again
> context = waiting_for_$1:$2_to_become_alive_again
> action = actions needed to send all clear; \
> delete waiting_for_$1:$2_to_become_alive_again
>
> desc = detect not responding nfs server:filesystem $1:$2
> type = pairwithwindow
> ptype = regexp
> pattern = nfs server ([^:]*):(/[^:]*): not responding
> action = do the action that reports that there is a failure;
> create waiting_for_$1:$2_to_become_alive_again
> ptype2 = substr
> pattern2 = nfs server $1:$2: is alive again
> desc2 = found the recover for $1:$2 within 60 seconds
> action2 = none
> window = 60
>
> Basically the idea is when you send an alert create a context that
> turns on a single rule that will send the clear for that
> host:filesystem when it sees the corresponding "is alive again" event.
>
> The downside of this pair of events is that it may be possible to get
> the following:
>
> delay event
> (sec)
>
> 0 "not responding"
> 60 send alert/create context
> 61 "not responding"
> 62 "is alive"
>
> the "is alive" event at delay 62 will be captured by the single rule
> and the pairwithwindow rule will generate a new alert. If you add
> continue=takenext to the single rule, that will get around this issue,
> but may cause other rules in your ruleset to trigger.
>
> This should give you an idea of how to do it and you can adapt it to
> the particular rules you are using to match the original not
> responding/is alive pair of events.
Risto and John, thanks for your feedback, you got me to where I was
generating the alerts. I get
Today I was trying to throw in a wrinkle: To include in the report all
of the related log messages so that the alert, and especially the
all-clear alert, would provide more context to the recipient. This gets
me close, the NFS-Alert:$client:$server:$filesystem context gets thrown
away at some point.
type = Single
ptype = regexp
continue = takenext
pattern = ([^ ]+) kernel: (newnfs|nfs) server ([^:]+):([^:]+): not
responding
desc = NFS problem between client $1 and server $3 on filesystem $4
action = add NFS:$1:$3:$4 $0
type = Single
ptype = regexp
continue = takenext
pattern = ([^ ]+) kernel: (newnfs|nfs) server ([^:]+):([^:]+): not
responding
context = NFS-Alert:$1:$3:$4
desc = NFS problem between client $1 and server $3 on filesystem $4
action = add NFS-Alert:$1:$3:$4 $0
type = PairWithWindow
ptype = regexp
continue = takenext
pattern = ([^ ]+) kernel: (newnfs|nfs) server ([^:]+):([^:]+): not
responding
context = !NFS-Alert:$1:$3:$4
desc = NFS problem between client $1 and server $3 on filesystem $4
action = create NFS-Alert:$1:$3:$4; \
exists %nfs NFS:$1:$3:$4; \
if %nfs (copy NFS:$1:$3:$4 %nfs; add NFS-Alert:$1:$3:$4 %nfs); \
report NFS-Alert:$1:$3:$4 /usr/bin/mail -s "NFS extended
problems on $1" rand
ptype2 = regexp
continue2 = takenext
pattern2 = "$1 kernel: (newnfs|nfs) server $2:$3: is alive again
desc2 = NFS working between client $1 and server $3 on filesystem $4
action2 = none
window = 30
type = Single
ptype = regexp
continue = takenext
pattern = ([^ ]+) kernel: (newnfs|nfs) server ([^:]+):([^:]+): is alive
again
desc = NFS now working between client $1 and server $2 on filesystem $3
action = add NFS-Alert:$1:$3:$4 $0; \
report NFS-Alert:$1:$3:$4 /usr/bin/mail -s "NFS working on $1"
rand; \
delete NFS-Alert:$1:$3:$4
type = Single
ptype = regexp
continue = takenext
pattern = ([^ ]+) kernel: (newnfs|nfs) server ([^:]+):([^:]+): is alive
again
desc = NFS destroy collector client $1 and server $2 on filesystem $3
action = delete NFS:$1:$3:$4
Here is what I've been trying as test input with the timestamps used as
rough timings:
Feb 17 00:00:00 <kern.info> deneb kernel: nfs server pid950@deneb:/home:
not responding
Feb 17 00:00:15 <kern.info> deneb kernel: nfs server pid950@deneb:/home:
not responding
Feb 17 00:00:30 <kern.info> deneb kernel: nfs server pid950@deneb:/home:
not responding
Feb 17 00:01:00 <kern.info> deneb kernel: nfs server pid950@deneb:/home:
not responding
Feb 17 00:01:02 <kern.info> deneb kernel: nfs server pid950@deneb:/home:
is alive again
Feb 17 00:01:05 <kern.info> deneb kernel: nfs server pid950@deneb:/home:
is alive again
What I'm hoping for is 2 emails:
The first at 00:00:30 with a subject of "NFS extended problems on deneb"
with a body of:
Feb 17 00:00:00 <kern.info> deneb kernel: nfs server pid950@deneb:/home:
not responding
Feb 17 00:00:15 <kern.info> deneb kernel: nfs server pid950@deneb:/home:
not responding
The second email at 00:01:02 with a subject of "NFS working on deneb"
with a body of:
Feb 17 00:00:00 <kern.info> deneb kernel: nfs server pid950@deneb:/home:
not responding
Feb 17 00:00:15 <kern.info> deneb kernel: nfs server pid950@deneb:/home:
not responding
Feb 17 00:00:35 <kern.info> deneb kernel: nfs server pid950@deneb:/home:
not responding
Feb 17 00:01:00 <kern.info> deneb kernel: nfs server pid950@deneb:/home:
not responding
Feb 17 00:01:02 <kern.info> deneb kernel: nfs server pid950@deneb:/home:
is alive again
Feb 17 00:01:05 <kern.info> deneb kernel: nfs server pid950@deneb:/home:
is alive again
I admit that having all of the log lines in the reports is just icing,
with your help I've already got the cake baked.
From your original advice, this is what I came up with:
type = Single
ptype = regexp
continue = dontcont
context = NFS-Alert:$1:$3:$4
pattern = ([^ ]+) kernel: (newnfs|nfs) server ([^:]+):([^:]+): not
responding
desc = NFS continued problems between client $1 and server $2 on
filesystem $3
action = add NFS-Alert:$1:$3:$4 $0
type = PairWithWindow
ptype = regexp
pattern = ([^ ]+) kernel: (newnfs|nfs) server ([^:]+):([^:]+): not
responding
desc = NFS problem between client $1 and server $3 on filesystem $4
action = create NFS-Alert:$1:$3:$4; \
add NFS-Alert:$1:$3:$4 $0; \
report NFS-Alert:$1:$3:$4 /usr/bin/mail -s "NFS extended
problems on $1" rand
ptype2 = regexp
pattern2 = "$1 kernel: (newnfs|nfs) server $2:$3: is alive again
desc2 = NFS working between client $1 and server $3 on filesystem $4
action2 = none
window = 30
type = Single
ptype = regexp
context = NFS-Alert:$1:$3:$4
pattern = ([^ ]+) kernel: (newnfs|nfs) server ([^:]+):([^:]+): is alive
again
desc = NFS now working between client $1 and server $2 on filesystem $3
action = add NFS-Alert:$1:$3:$4 $0; \
report NFS-Alert:$1:$3:$4 /usr/bin/mail -s "NFS working on $1"
rand; \
delete NFS-Alert:$1:$3:$4
But it suffers from a problem. In the above set of syslog messages, the
one at 00:00:15 gets dropped. What seems to happen is that all messages
between the very first one that matches the PairWithWindow event and all
of the interveniening ones until the window closes don't get added to
the context. And that seems to be exactly what PairWithWindow is
supposed to do.
Thanks again.
------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121054471&iu=/4140/ostg.clktrk
_______________________________________________
Simple-evcorr-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/simple-evcorr-users