hi Douglas,
I would personally use NFS-Alert:* context only for indicating that an 
alarm has been previously issued, and keep the NFS:* context for storing 
data throughout the event correlation scheme. Switching the roles in the 
middle of the scheme will complicate things and make them hard to 
understand later. I modified your ruleset a bit and was able to lessen 
the number of rules somewhat:

type = Single
ptype = regexp
continue = takenext
pattern = ([^ ]+) kernel: (newnfs|nfs) server ([^:]+):([^:]+): not 
responding
desc = NFS problem between client $1 and server $3 on filesystem $4
action = add NFS:$1:$3:$4 $0

type = PairWithWindow
ptype = regexp
continue = takenext
pattern = ([^ ]+) kernel: (newnfs|nfs) server ([^:]+):([^:]+): not 
responding
context = !NFS-Alert:$1:$3:$4
desc = NFS problem between client $1 and server $3 on filesystem $4
action = create NFS-Alert:$1:$3:$4; \
          report NFS:$1:$3:$4 /usr/bin/mail -s "NFS extended problems on 
$1" risto
ptype2 = regexp
continue2 = takenext
pattern2 = \s$1 kernel: (newnfs|nfs) server $3:$4: is alive again
desc2 = NFS working between client %1 and server %3 on filesystem %4
action2 = delete NFS:%1:%3:%4
window = 30

type = Single
ptype = regexp
continue = takenext
pattern = ([^ ]+) kernel: (newnfs|nfs) server ([^:]+):([^:]+): is alive 
again
context = NFS-Alert:$1:$3:$4
desc = NFS now working between client $1 and server $3 on filesystem $4
action = add NFS:$1:$3:$4 $0; \
          report NFS:$1:$3:$4 /usr/bin/mail -s "NFS working on $1" risto; \
          delete NFS:$1:$3:$4; delete NFS-Alert:$1:$3:$4


Note that the second rule removes the data context if alarm is not 
issued (we need this kind of garbage collection, because otherwise each 
very short NFS timeout will leave behind a context with infinite 
lifetime). Also, the third rule which issues AllClear-message will drop 
both the data and alarm-flag context.

Hope this helps,
risto


On 02/19/2014 07:31 PM, Douglas K. Rand wrote:
> On 02/18/14 17:27, John P. Rouillard wrote:
>>
>> In message <[email protected]>,
>> "Douglas K. Rand" writes:
>>> With the usual BSD syslog messages related to NFS problems:
>>>
>>> hostA kernel: nfs server hostb:/filesystem: not responding
>>> ...
>>> hostA kernel: nfs server hostb:/filesystem: is alive again
>>>
>>> What I'm trying to do is generate an alert email if we see a "not
>>> responding" message with out a corresponding "is alive again" within 60
>>> seconds. But if we sent out the alert email I also want to send an
>>> all-clear email when we do eventually get the "alive again" message,
>>> perhaps even hours later.
>>>
>>> I've figured out how to do one or the other: I can generate the alert
>>> email if a "not responding" message is not followed by a "alive again"
>>> message with in 60 seconds.
>>>
>>> And I can generate the all-clear message for each and every "alive
>>> again" message.
>>>
>>> But putting them together is stumping me. I only want to send the
>>> all-clear message if we already have sent out the alert email; but if we
>>> don't send out the alert there is no reason to send out the all-clear.
>>>
>>> I was thinking if when the "alive again" message came in if I could do
>>> something like:
>>>
>>> if ($age > 60) report context mail -s "NFS all-clear" rand;
>>> delete context
>>>
>>> I'm not even sure that is the right approach, seems to not fit into SEC,
>>> even if I could figure out how to do it.
>>
>>
>> Try this:
>>
>> desc = send clear for $1:$2 if we sent an alert
>> type = single
>> ptype = regexp
>> pattern = nfs server ([^:]*):(/[^:]*): is alive again
>> context = waiting_for_$1:$2_to_become_alive_again
>> action = actions needed to send all clear; \
>>            delete waiting_for_$1:$2_to_become_alive_again
>>
>> desc = detect not responding nfs server:filesystem $1:$2
>> type = pairwithwindow
>> ptype = regexp
>> pattern = nfs server ([^:]*):(/[^:]*): not responding
>> action = do the action that reports that there is a failure;
>>            create waiting_for_$1:$2_to_become_alive_again
>> ptype2 = substr
>> pattern2 = nfs server $1:$2: is alive again
>> desc2 = found the recover for $1:$2 within 60 seconds
>> action2 = none
>> window = 60
>>
>> Basically the idea is when you send an alert create a context that
>> turns on a single rule that will send the clear for that
>> host:filesystem when it sees the corresponding "is alive again" event.
>>
>> The downside of this pair of events is that it may be possible to get
>> the following:
>>
>> delay    event
>> (sec)
>>
>> 0       "not responding"
>> 60      send alert/create context
>> 61      "not responding"
>> 62      "is alive"
>>
>> the "is alive" event at delay 62 will be captured by the single rule
>> and the pairwithwindow rule will generate a new alert. If you add
>> continue=takenext to the single rule, that will get around this issue,
>> but may cause other rules in your ruleset to trigger.
>>
>> This should give you an idea of how to do it and you can adapt it to
>> the particular rules you are using to match the original not
>> responding/is alive pair of events.
>
> Risto and John, thanks for your feedback, you got me to where I was
> generating the alerts. I get
>
> Today I was trying to throw in a wrinkle: To include in the report all
> of the related log messages so that the alert, and especially the
> all-clear alert, would provide more context to the recipient. This gets
> me close, the NFS-Alert:$client:$server:$filesystem context gets thrown
> away at some point.
>
> type = Single
> ptype = regexp
> continue = takenext
> pattern = ([^ ]+) kernel: (newnfs|nfs) server ([^:]+):([^:]+): not
> responding
> desc = NFS problem between client $1 and server $3 on filesystem $4
> action = add NFS:$1:$3:$4 $0
>
> type = Single
> ptype = regexp
> continue = takenext
> pattern = ([^ ]+) kernel: (newnfs|nfs) server ([^:]+):([^:]+): not
> responding
> context = NFS-Alert:$1:$3:$4
> desc = NFS problem between client $1 and server $3 on filesystem $4
> action = add NFS-Alert:$1:$3:$4 $0
>
> type = PairWithWindow
> ptype = regexp
> continue = takenext
> pattern = ([^ ]+) kernel: (newnfs|nfs) server ([^:]+):([^:]+): not
> responding
> context = !NFS-Alert:$1:$3:$4
> desc = NFS problem between client $1 and server $3 on filesystem $4
> action = create NFS-Alert:$1:$3:$4; \
>            exists %nfs NFS:$1:$3:$4; \
>            if %nfs (copy NFS:$1:$3:$4 %nfs; add NFS-Alert:$1:$3:$4 %nfs); \
>            report NFS-Alert:$1:$3:$4 /usr/bin/mail -s "NFS extended
> problems on $1" rand
> ptype2 = regexp
> continue2 = takenext
> pattern2 = "$1 kernel: (newnfs|nfs) server $2:$3: is alive again
> desc2 = NFS working between client $1 and server $3 on filesystem $4
> action2 = none
> window = 30
>
> type = Single
> ptype = regexp
> continue = takenext
> pattern = ([^ ]+) kernel: (newnfs|nfs) server ([^:]+):([^:]+): is alive
> again
> desc = NFS now working between client $1 and server $2 on filesystem $3
> action = add NFS-Alert:$1:$3:$4 $0; \
>          report NFS-Alert:$1:$3:$4 /usr/bin/mail -s "NFS working on $1"
> rand; \
>          delete NFS-Alert:$1:$3:$4
>
> type = Single
> ptype = regexp
> continue = takenext
> pattern = ([^ ]+) kernel: (newnfs|nfs) server ([^:]+):([^:]+): is alive
> again
> desc = NFS destroy collector client $1 and server $2 on filesystem $3
> action = delete NFS:$1:$3:$4
>
>
> Here is what I've been trying as test input with the timestamps used as
> rough timings:
>
> Feb 17 00:00:00 <kern.info> deneb kernel: nfs server pid950@deneb:/home:
> not responding
> Feb 17 00:00:15 <kern.info> deneb kernel: nfs server pid950@deneb:/home:
> not responding
> Feb 17 00:00:30 <kern.info> deneb kernel: nfs server pid950@deneb:/home:
> not responding
> Feb 17 00:01:00 <kern.info> deneb kernel: nfs server pid950@deneb:/home:
> not responding
> Feb 17 00:01:02 <kern.info> deneb kernel: nfs server pid950@deneb:/home:
> is alive again
> Feb 17 00:01:05 <kern.info> deneb kernel: nfs server pid950@deneb:/home:
> is alive again
>
> What I'm hoping for is 2 emails:
>
> The first at 00:00:30 with a subject of "NFS extended problems on deneb"
> with a body of:
> Feb 17 00:00:00 <kern.info> deneb kernel: nfs server pid950@deneb:/home:
> not responding
> Feb 17 00:00:15 <kern.info> deneb kernel: nfs server pid950@deneb:/home:
> not responding
>
> The second email at 00:01:02 with a subject of "NFS working on deneb"
> with a body of:
> Feb 17 00:00:00 <kern.info> deneb kernel: nfs server pid950@deneb:/home:
> not responding
> Feb 17 00:00:15 <kern.info> deneb kernel: nfs server pid950@deneb:/home:
> not responding
> Feb 17 00:00:35 <kern.info> deneb kernel: nfs server pid950@deneb:/home:
> not responding
> Feb 17 00:01:00 <kern.info> deneb kernel: nfs server pid950@deneb:/home:
> not responding
> Feb 17 00:01:02 <kern.info> deneb kernel: nfs server pid950@deneb:/home:
> is alive again
> Feb 17 00:01:05 <kern.info> deneb kernel: nfs server pid950@deneb:/home:
> is alive again
>
> I admit that having all of the log lines in the reports is just icing,
> with your help I've already got the cake baked.
>
>   From your original advice, this is what I came up with:
>
> type = Single
> ptype = regexp
> continue = dontcont
> context = NFS-Alert:$1:$3:$4
> pattern = ([^ ]+) kernel: (newnfs|nfs) server ([^:]+):([^:]+): not
> responding
> desc = NFS continued problems between client $1 and server $2 on
> filesystem $3
> action = add NFS-Alert:$1:$3:$4 $0
>
> type = PairWithWindow
> ptype = regexp
> pattern = ([^ ]+) kernel: (newnfs|nfs) server ([^:]+):([^:]+): not
> responding
> desc = NFS problem between client $1 and server $3 on filesystem $4
> action = create NFS-Alert:$1:$3:$4; \
>            add NFS-Alert:$1:$3:$4 $0; \
>            report NFS-Alert:$1:$3:$4 /usr/bin/mail -s "NFS extended
> problems on $1" rand
> ptype2 = regexp
> pattern2 = "$1 kernel: (newnfs|nfs) server $2:$3: is alive again
> desc2 = NFS working between client $1 and server $3 on filesystem $4
> action2 = none
> window = 30
>
> type = Single
> ptype = regexp
> context = NFS-Alert:$1:$3:$4
> pattern = ([^ ]+) kernel: (newnfs|nfs) server ([^:]+):([^:]+): is alive
> again
> desc = NFS now working between client $1 and server $2 on filesystem $3
> action = add NFS-Alert:$1:$3:$4 $0; \
>            report NFS-Alert:$1:$3:$4 /usr/bin/mail -s "NFS working on $1"
> rand; \
>            delete NFS-Alert:$1:$3:$4
>
> But it suffers from a problem. In the above set of syslog messages, the
> one at 00:00:15 gets dropped. What seems to happen is that all messages
> between the very first one that matches the PairWithWindow event and all
> of the interveniening ones until the window closes don't get added to
> the context. And that seems to be exactly what PairWithWindow is
> supposed to do.
>
> Thanks again.
>
>
>
> ------------------------------------------------------------------------------
> Managing the Performance of Cloud-Based Applications
> Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
> Read the Whitepaper.
> http://pubads.g.doubleclick.net/gampad/clk?id=121054471&iu=/4140/ostg.clktrk
> _______________________________________________
> Simple-evcorr-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/simple-evcorr-users
>
>


------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121054471&iu=/4140/ostg.clktrk
_______________________________________________
Simple-evcorr-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/simple-evcorr-users

Reply via email to