On 02/20/14 07:40, Risto Vaarandi wrote:
> hi Douglas,
Hi Risto!
> I would personally use NFS-Alert:* context only for indicating that an
> alarm has been previously issued, and keep the NFS:* context for storing
> data throughout the event correlation scheme. Switching the roles in the
> middle of the scheme will complicate things and make them hard to
> understand later. I modified your ruleset a bit and was able to lessen
> the number of rules somewhat:
> type = Single
> ptype = regexp
> continue = takenext
> pattern = ([^ ]+) kernel: (newnfs|nfs) server ([^:]+):([^:]+): not
> responding
> desc = NFS problem between client $1 and server $3 on filesystem $4
> action = add NFS:$1:$3:$4 $0
>
> type = PairWithWindow
> ptype = regexp
> continue = takenext
> pattern = ([^ ]+) kernel: (newnfs|nfs) server ([^:]+):([^:]+): not
> responding
> context = !NFS-Alert:$1:$3:$4
> desc = NFS problem between client $1 and server $3 on filesystem $4
> action = create NFS-Alert:$1:$3:$4; \
> report NFS:$1:$3:$4 /usr/bin/mail -s "NFS extended problems on
> $1" risto
> ptype2 = regexp
> continue2 = takenext
> pattern2 = \s$1 kernel: (newnfs|nfs) server $3:$4: is alive again
> desc2 = NFS working between client %1 and server %3 on filesystem %4
> action2 = delete NFS:%1:%3:%4
> window = 30
>
> type = Single
> ptype = regexp
> continue = takenext
> pattern = ([^ ]+) kernel: (newnfs|nfs) server ([^:]+):([^:]+): is alive
> again
> context = NFS-Alert:$1:$3:$4
> desc = NFS now working between client $1 and server $3 on filesystem $4
> action = add NFS:$1:$3:$4 $0; \
> report NFS:$1:$3:$4 /usr/bin/mail -s "NFS working on $1" risto; \
> delete NFS:$1:$3:$4; delete NFS-Alert:$1:$3:$4
>
>
> Note that the second rule removes the data context if alarm is not
> issued (we need this kind of garbage collection, because otherwise each
> very short NFS timeout will leave behind a context with infinite
> lifetime). Also, the third rule which issues AllClear-message will drop
> both the data and alarm-flag context.
Hey, thanks for the help. That definitely cleans things up quite a bit.
For the sake of the list, here is an updated ruleset. I tweaked the
regexps to handle both FreeBSD and Linux kernel messages about NFS. I
also added a threshold rule to count problems that don't get reported
and if 5 show up in an hour to generate an alert. And lastly I made the
reporting a bit more concise with the event descriptions.
Thanks to all for the advice and direction!
## Collect all NFS problems into a context for each
## client/server/filesystem group.
type = Single
ptype = regexp
continue = takenext
pattern = ([^ ]+) kernel: (newnfs|nfs):? server ([^:]+):([^:]+):? not
responding
desc = NFS problem between $1 and $3:$4
action = add NFS:$1:$3:$4 $0
## If NFS doesn't start working again for a client/server/filesystem
## within the window, then generate an alert with all the messages
## so far as the body of the email. Create a NFS-Alert context to
## remember that we generated the alert.
type = PairWithWindow
ptype = regexp
continue = takenext
pattern = ([^ ]+) kernel: (newnfs|nfs):? server ([^:]+):([^:]+):? not
responding
context = !NFS-Alert:$1:$3:$4
desc = NFS problem alarm between $1 and $3:$4
action = create NFS-Alert:$1:$3:$4; \
report NFS:$1:$3:$4 /usr/bin/mail -s "NFS extended problems on
$1" rand
ptype2 = regexp
continue2 = takenext
pattern2 = \s($1) kernel: ($2):? server ($3):($4):? (is alive again|OK)
desc2 = NFS brief problem between $1 and $3:$4
action2 = event; delete NFS:$1:$3:$4
window = 30
## NFS is working again, so if we generated an alert before now
## send an all-clear alert. And cleanup by deleting the contexts.
type = Single
ptype = regexp
continue = takenext
pattern = ([^ ]+) kernel: (newnfs|nfs):? server ([^:]+):([^:]+):? (is
alive again|OK)
context = NFS-Alert:$1:$3:$4
desc = NFS now working between $1 and $2:$3
action = add NFS:$1:$3:$4 $0; \
report NFS:$1:$3:$4 /usr/bin/mail -s "NFS now working on $1"
rand; \
delete NFS:$1:$3:$4; \
delete NFS-Alert:$1:$3:$4
## Keep a count of the unreported NFS problems, if we get more
## than the threshold, generate an alert
type = SingleWithThreshold
ptype = regexp
pattern = NFS brief problem between ([^ ]+) and ([^:]+):([^ ]+)
desc = NFS 5 problems between $1 and $2:$3 in the last hour
action = pipe '%s' /usr/bin/mail -s "NFS unstable on $1" rand
window = 3600
thresh = 5
------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121054471&iu=/4140/ostg.clktrk
_______________________________________________
Simple-evcorr-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/simple-evcorr-users