Re: [Simple-evcorr-users] Trying to report extended NFS problems along with an OK.

Douglas K. Rand Thu, 20 Feb 2014 10:12:12 -0800

On 02/20/14 07:40, Risto Vaarandi wrote:
> hi Douglas,

Hi Risto!


> I would personally use NFS-Alert:* context only for indicating that an
> alarm has been previously issued, and keep the NFS:* context for storing
> data throughout the event correlation scheme. Switching the roles in the
> middle of the scheme will complicate things and make them hard to
> understand later. I modified your ruleset a bit and was able to lessen
> the number of rules somewhat:

> type = Single
> ptype = regexp
> continue = takenext
> pattern = ([^ ]+) kernel: (newnfs|nfs) server ([^:]+):([^:]+): not
> responding
> desc = NFS problem between client $1 and server $3 on filesystem $4
> action = add NFS:$1:$3:$4 $0
>
> type = PairWithWindow
> ptype = regexp
> continue = takenext
> pattern = ([^ ]+) kernel: (newnfs|nfs) server ([^:]+):([^:]+): not
> responding
> context = !NFS-Alert:$1:$3:$4
> desc = NFS problem between client $1 and server $3 on filesystem $4
> action = create NFS-Alert:$1:$3:$4; \
>            report NFS:$1:$3:$4 /usr/bin/mail -s "NFS extended problems on
> $1" risto
> ptype2 = regexp
> continue2 = takenext
> pattern2 = \s$1 kernel: (newnfs|nfs) server $3:$4: is alive again
> desc2 = NFS working between client %1 and server %3 on filesystem %4
> action2 = delete NFS:%1:%3:%4
> window = 30
>
> type = Single
> ptype = regexp
> continue = takenext
> pattern = ([^ ]+) kernel: (newnfs|nfs) server ([^:]+):([^:]+): is alive
> again
> context = NFS-Alert:$1:$3:$4
> desc = NFS now working between client $1 and server $3 on filesystem $4
> action = add NFS:$1:$3:$4 $0; \
>            report NFS:$1:$3:$4 /usr/bin/mail -s "NFS working on $1" risto; \
>            delete NFS:$1:$3:$4; delete NFS-Alert:$1:$3:$4
>
>
> Note that the second rule removes the data context if alarm is not
> issued (we need this kind of garbage collection, because otherwise each
> very short NFS timeout will leave behind a context with infinite
> lifetime). Also, the third rule which issues AllClear-message will drop
> both the data and alarm-flag context.

Hey, thanks for the help. That definitely cleans things up quite a bit.

For the sake of the list, here is an updated ruleset. I tweaked the 
regexps to handle both FreeBSD and Linux kernel messages about NFS. I 
also added a threshold rule to count problems that don't get reported 
and if 5 show up in an hour to generate an alert. And lastly I made the 
reporting a bit more concise with the event descriptions.

Thanks to all for the advice and direction!


## Collect all NFS problems into a context for each
## client/server/filesystem group.
type = Single
ptype = regexp
continue = takenext
pattern = ([^ ]+) kernel: (newnfs|nfs):? server ([^:]+):([^:]+):? not 
responding
desc = NFS problem between $1 and $3:$4
action = add NFS:$1:$3:$4 $0

## If NFS doesn't start working again for a client/server/filesystem
## within the window, then generate an alert with all the messages
## so far as the body of the email. Create a NFS-Alert context to
## remember that we generated the alert.
type = PairWithWindow
ptype = regexp
continue = takenext
pattern = ([^ ]+) kernel: (newnfs|nfs):? server ([^:]+):([^:]+):? not 
responding
context = !NFS-Alert:$1:$3:$4
desc = NFS problem alarm between $1 and $3:$4
action = create NFS-Alert:$1:$3:$4; \
          report NFS:$1:$3:$4 /usr/bin/mail -s "NFS extended problems on 
$1" rand
ptype2 = regexp
continue2 = takenext
pattern2 = \s($1) kernel: ($2):? server ($3):($4):? (is alive again|OK)
desc2 = NFS brief problem between $1 and $3:$4
action2 = event; delete NFS:$1:$3:$4
window = 30

## NFS is working again, so if we generated an alert before now
## send an all-clear alert. And cleanup by deleting the contexts.
type = Single
ptype = regexp
continue = takenext
pattern = ([^ ]+) kernel: (newnfs|nfs):? server ([^:]+):([^:]+):? (is 
alive again|OK)
context = NFS-Alert:$1:$3:$4
desc = NFS now working between $1 and $2:$3
action = add NFS:$1:$3:$4 $0; \
          report NFS:$1:$3:$4 /usr/bin/mail -s "NFS now working on $1" 
rand; \
          delete NFS:$1:$3:$4; \
          delete NFS-Alert:$1:$3:$4

## Keep a count of the unreported NFS problems, if we get more
## than the threshold, generate an alert
type = SingleWithThreshold
ptype = regexp
pattern = NFS brief problem between ([^ ]+) and ([^:]+):([^ ]+)
desc = NFS 5 problems between $1 and $2:$3 in the last hour
action = pipe '%s' /usr/bin/mail -s "NFS unstable on $1" rand
window = 3600
thresh = 5


------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121054471&iu=/4140/ostg.clktrk
_______________________________________________
Simple-evcorr-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/simple-evcorr-users

Re: [Simple-evcorr-users] Trying to report extended NFS problems along with an OK.

Reply via email to