Jim, Here's the patch of all my current changes.
This is relative to the patch I sent a month ago. I'm also including a patch to the mon manpage to document the new features I've added. (Including documenting the new authentication type 'trustlocal' I added in the last patch.) Let me know if you have any questions/comments. I'm going to start working on the per-host status tracking in the next week or so, but I thought you might want to try to get 0.99.3 out the door before you start looking at integrating the major rewrite I'll have to do for that. -David Nolan Network Software Developer Computing Services Carnegie Mellon University
Index: mon.8 =================================================================== RCS file: /afs/andrew/system/cvs/src/netsage/mon/doc/mon.8,v retrieving revision 1.1 retrieving revision 1.2 diff -c -r1.1 -r1.2 *** mon.8 2002/09/26 12:33:11 1.1 --- mon.8 2002/09/26 12:37:12 1.2 *************** *** 22,27 **** --- 22,29 ---- .RB [ \-k .IR num ] .RB [ \-l + .IR [ statetype ] ] + .RB [ \-L .IR dir ] .RB [ \-m .IR num ] *************** *** 105,113 **** entries. Defaults to 100. .TP ! .BI \-l ! Load state from the last saved state file. Currently the only ! supported saved state is disabled watches, services, and hosts. .TP .BI \-L\ dir Sets the log dir. See also --- 107,125 ---- entries. Defaults to 100. .TP ! .BI \-l\ statetype ! Load state from the last saved state file. The ! supported saved state types are ! .B disabled ! for disabled watches, services, and hosts, ! .B opstatus ! for failure/alert/ack status of ! all services, ! and ! .B all ! for both. If no statetype is provided, ! .B disabled ! is assumed. .TP .BI \-L\ dir Sets the log dir. See also *************** *** 346,352 **** .B dep_behavior is set to .IR "'a'" , ! and a parent dependency is failing, then suppress the alert. If the alert has previously been acknowledged, do not send the alert, unless it is an upalert. If an alert is not within the specified period, record the failure --- 358,366 ---- .B dep_behavior is set to .IR "'a'" , ! or ! .B alertdepend ! is set, and a parent dependency is failing, then suppress the alert. If the alert has previously been acknowledged, do not send the alert, unless it is an upalert. If an alert is not within the specified period, record the failure *************** *** 631,636 **** --- 645,659 ---- .B passwd service will be used. + If + .I type + is + .BR trustlocal , + then if the client connection comes from locahost, the username passed from + the client will be trusted, and the password will be ignored. This can be used + when you want the client to handle the authentication for you. I.e. a CGI script + using one of the many apache authentication methods. + .TP .BI "userfile = " file This file is used when *************** *** 817,829 **** The default limit is 10. .TP ! .BI "dep_behavior = " {a|m} .B dep_behavior controls whether the dependency expression ! suppresses either the running of alerts or monitors ! when a node in the dependency graph fails. Read more ! about the behavior in the "Service Definitions" section ! below. This is a global setting which controls the default settings for the service-specified variable. --- 840,852 ---- The default limit is 10. .TP ! .BI "dep_behavior = " {a|m|hm} .B dep_behavior controls whether the dependency expression ! suppresses one of: the running of alerts, the running of ! monitors, or the passing of individual hosts to the monitors. ! Read more about the behavior in the "Service Definitions" ! section below. This is a global setting which controls the default settings for the service-specified variable. *************** *** 1023,1032 **** the machine being ping-reachable. .TP ! .BI dep_behavior " {a|m}" ! The evaluation of dependency graphs can control the ! suppression of either alert or monitor invocations. .BR "Alert suppression" . If this option is set to "a", --- 1046,1058 ---- the machine being ping-reachable. .TP ! .BI dep_behavior " {a|m|hm}" ! The evaluation of the dependency graphs specified via the ! .B depend ! keyword can control the ! suppression of alert or monitor invocations, or the suppression ! of individual hosts passed to the monitor. .BR "Alert suppression" . If this option is set to "a", *************** *** 1047,1052 **** --- 1073,1111 ---- will be run. Otherwise, the monitor will not be run and the status of the service will remain the same. + + .BR "Host suppression" . + If it is set to "hm" then Mon will extract the list of "parent" + services from the dependency expression. (In fact the expression can + be just a list of services.) Then when the monitor for the service is + about to be run, for each host in the current hostgroup Mon will + search all the parent services which are currently failing and look + for the hostname in the current summary output. If the hostname is + found, this host will be excluded from this run of the monitor. This + can be used to e.g. allow an SMTP test on a group of hosts to still be run + even when a single host is not ping-reachable. If all the rest of the + hosts are working fine, the service will be in an OK state, but if + another host fails the SMTP test Mon can still alert about that host + even though the parent dependency was failing. The dependency + expression will + .B not + be used recursively in this case. + + .TP + .BI alertdepend " dependexpression" + .TP + .BI monitordepend " dependexpression" + .TP + .BI hostdepend " dependexpression" + These keywords allow you to specify multiple dependency expressions of + different types. Each one corresponds to the different + .B dep_behavior + settings listed above. They will be evaluated independently in the different + contexts as listed above. If + .B depend + is present, it takes precedence over the matching keyword, depending on the + .B dep_behavior + setting. .SS "Period Definitions"
Index: mon =================================================================== RCS file: /afs/andrew/system/cvs/src/netsage/mon/bin/mon,v retrieving revision 1.9 diff -c -r1.9 mon *** mon 2002/08/19 19:09:44 1.9 --- mon 2002/09/26 12:18:05 *************** *** 65,70 **** --- 65,71 ---- sub debug; sub debug_dir; sub dep_ok; + sub dep_summary; sub depend; sub dhmstos; sub die_die; *************** *** 198,204 **** # # argument parsing # ! getopts ("fhlMSvda:A:b:B:c:D:i:L:m:O:o:p:P:r:s:t:", \%opt); # # these two things can be taken care of without --- 199,205 ---- # # argument parsing # ! getopts ("fhMSvda:A:b:B:c:D:i:l:L:m:O:o:p:P:r:s:t:", \%opt); # # these two things can be taken care of without *************** *** 343,350 **** # # load previously saved state # ! load_state ("disabled") if ($opt{"l"}); syslog ('info', "mon server started"); # --- 344,362 ---- # # load previously saved state # ! if (exists $opt{"l"}) { ! if ($opt{"l"}) { ! # If -l was given an argument (all, disabled, opstatus, etc...) ! # pass that to load_state ! load_state($opt{"l"}); ! }else{ ! # Otherwise default to old behavior of just loading disabled hosts/services/groups ! load_state("disabled"); ! } ! } ! + syslog ('info', "mon server started"); # *************** *** 369,375 **** # # skip over disabled watch # ! next if ($watch_disabled{$group} == 1); foreach my $service (keys %{$watch{$group}}) { --- 381,387 ---- # # skip over disabled watch # ! next if (exists $watch_disabled{$group} && $watch_disabled{$group} == 1); foreach my $service (keys %{$watch{$group}}) { *************** *** 384,390 **** if ($sref->{"traptimeout"}) { $sref->{"_trap_timer"} -= $t; ! if ($sref->{"_trap_timer"} <= 0 && $tm - $sref->{"_last_uptrap"} > $sref->{"traptimeout"}) { $sref->{"_trap_timer"} = $sref->{"traptimeout"}; handle_trap_timeout ($group, $service); --- 396,402 ---- if ($sref->{"traptimeout"}) { $sref->{"_trap_timer"} -= $t; ! if ($sref->{"_trap_timer"} <= 0 && $tm - $sref->{"_last_trap"} > $sref->{"traptimeout"}) { $sref->{"_trap_timer"} = $sref->{"traptimeout"}; handle_trap_timeout ($group, $service); *************** *** 411,426 **** { if (!$CF{"MAXPROCS"} || $procs < $CF{"MAXPROCS"}) { ! if ($sref->{"exclude_period"} ne "" && ! inPeriod (time, $sref->{"exclude_period"})) { debug (1, "not running $group,$service because of exclude_period\n"); } ! elsif ($sref->{"dep_behavior"} eq "m" && ! $sref->{"depend"} ne "") { ! if (dep_ok ($sref)) { run_monitor ($group, $service); } --- 423,440 ---- { if (!$CF{"MAXPROCS"} || $procs < $CF{"MAXPROCS"}) { ! if (defined $sref->{"exclude_period"} ! && $sref->{"exclude_period"} ne "" && ! inPeriod (time, $sref->{"exclude_period"})) { debug (1, "not running $group,$service because of exclude_period\n"); } ! elsif (($sref->{"dep_behavior"} eq "m" && ! defined $sref->{"depend"} && $sref->{"depend"} ne "") ! || (defined $sref->{"monitordepend"} && $sref->{"monitordepend"} ne "")) { ! if (dep_ok ($sref, 'm')) { run_monitor ($group, $service); } *************** *** 530,536 **** # # if the alarm is disabled, ignore it # ! if ($sref->{"disable"} == 1) { syslog ("notice", "ignoring alert for $group,$service"); return; --- 544,550 ---- # # if the alarm is disabled, ignore it # ! if (defined $sref->{"disable"} && $sref->{"disable"} == 1) { syslog ("notice", "ignoring alert for $group,$service"); return; *************** *** 540,548 **** # dependency check # if (!($flags & $FL_STARTUPALERT) && ! !($flags & $FL_UPALERT) && ! defined $sref->{"depend"} && ! $sref->{"dep_behavior"} eq "a") { if (!$sref->{"_depend_status"}) { --- 554,562 ---- # dependency check # if (!($flags & $FL_STARTUPALERT) && ! !($flags & $FL_UPALERT) && ! ((defined $sref->{"depend"} && $sref->{"dep_behavior"} eq "a") ! || (defined $sref->{"alertdepend"}))) { if (!$sref->{"_depend_status"}) { *************** *** 562,568 **** } my ($summary) = split("\n", $output); ! $summary = "(NO SUMMARY)" if ($summary =~ /^\s*$/m); # # check each time period for pending alerts --- 576,582 ---- } my ($summary) = split("\n", $output); ! $summary = "(NO SUMMARY)" if (!defined $summary || $summary =~ /^\s*$/m); # # check each time period for pending alerts *************** *** 1002,1008 **** $new_CF{"DEP_RECUR_LIMIT"} = $2; } elsif ($1 eq "dep_behavior") { ! if ($2 ne "m" && $2 ne "a") { close (CFG); return "cf error: unknown dependency behavior '$2', line $line_num"; } --- 1016,1022 ---- $new_CF{"DEP_RECUR_LIMIT"} = $2; } elsif ($1 eq "dep_behavior") { ! if ($2 ne "m" && $2 ne "a" && $2 ne "hm") { close (CFG); return "cf error: unknown dependency behavior '$2', line $line_num"; } *************** *** 1208,1213 **** --- 1222,1228 ---- $sref->{"_last_failure"} = 0 if (!defined($sref->{"_last_failure"})); $sref->{"_last_success"} = 0 if (!defined($sref->{"_last_success"})); $sref->{"_last_trap"} = 0 if (!defined($sref->{"_last_trap"})); + $sref->{"_last_traphost"} = '' if +(!defined($sref->{"_last_traphost"})); $sref->{"_exitval"} = "undef" if (!defined($sref->{"_exitval"})); $sref->{"_last_check"} = undef; $sref->{"_depend_status"} = undef; *************** *** 1472,1478 **** elsif ($var eq "dep_behavior") { ! if ($args ne "m" && $args ne "a") { close (CFG); return "cf error: unknown dependency behavior '$args' (syntax: dep_behavior = {m|a}), line $line_num"; --- 1487,1493 ---- elsif ($var eq "dep_behavior") { ! if ($args ne "m" && $args ne "a" && $args ne "hm") { close (CFG); return "cf error: unknown dependency behavior '$args' (syntax: dep_behavior = {m|a}), line $line_num"; *************** *** 1484,1489 **** --- 1499,1519 ---- $args =~ s/SELF:/$watchgroup:/g; } + elsif ($var eq "alertdepend") + { + $args =~ s/SELF:/$watchgroup:/g; + } + + elsif ($var eq "monitordepend") + { + $args =~ s/SELF:/$watchgroup:/g; + } + + elsif ($var eq "hostdepend") + { + $args =~ s/SELF:/$watchgroup:/g; + } + elsif ($var eq "exclude_hosts") { my $ex = {}; *************** *** 1594,1600 **** } $procs = 0; ! syslog ('info', "resetting, and re-reading configuration $CF{CF}"); if ((my $err = read_cf ($CF{"CF"}, 1)) ne "") { --- 1624,1630 ---- } $procs = 0; ! save_state ("all") if ($keepstate); syslog ('info', "resetting, and re-reading configuration $CF{CF}"); if ((my $err = read_cf ($CF{"CF"}, 1)) ne "") { *************** *** 1608,1614 **** $fdset_rbits = $fdset_ebits = ''; set_last_test (); randomize_startdelay() if ($CF{"RANDSTART"}); ! load_state ("disabled") if ($keepstate); if ($CF{"DTLOGGING"}) { init_dtlog(); } --- 1638,1644 ---- $fdset_rbits = $fdset_ebits = ''; set_last_test (); randomize_startdelay() if ($CF{"RANDSTART"}); ! load_state ("all") if ($keepstate); if ($CF{"DTLOGGING"}) { init_dtlog(); } *************** *** 1678,1684 **** my ($fl); $fl = ''; ! fcntl ($fh, F_GETFL, $fl) || return; $fl |= O_NONBLOCK; fcntl ($fh, F_SETFL, $fl) || return; --- 1708,1714 ---- my ($fl); $fl = ''; ! $fl = fcntl ($fh, F_GETFL, $fl) || return; $fl |= O_NONBLOCK; fcntl ($fh, F_SETFL, $fl) || return; *************** *** 2193,2199 **** # list status of all services # } elsif ($cmd eq "opstatus") { ! if ($args eq "") { foreach $group (keys %watch) { foreach $service (keys %{$watch{$group}}) { --- 2223,2229 ---- # list status of all services # } elsif ($cmd eq "opstatus") { ! if (!defined $args || $args eq "") { foreach $group (keys %watch) { foreach $service (keys %{$watch{$group}}) { *************** *** 2243,2253 **** } } foreach $group (keys %watch) { ! if ($watch_disabled{$group} == 1) { sock_write ($fh, "watch $group\n"); } foreach $service (keys %{$watch{$group}}) { ! if ($watch{$group}->{$service}->{'disable'} == 1) { sock_write ($fh, "watch $group service " . "$service\n"); } --- 2273,2284 ---- } } foreach $group (keys %watch) { ! if (exists $watch_disabled{$group} && $watch_disabled{$group} == 1) { sock_write ($fh, "watch $group\n"); } foreach $service (keys %{$watch{$group}}) { ! if (defined $watch{$group}->{$service}->{'disable'} ! && $watch{$group}->{$service}->{'disable'} == 1) { sock_write ($fh, "watch $group service " . "$service\n"); } *************** *** 2583,2589 **** # check auth # } elsif ($cmd eq "checkauth") { ! split(' ',$args); $cmd = $_[0]; $user = $clients{$cl}->{"user"}; # Note that we call check_auth without syslogging here. --- 2614,2620 ---- # check auth # } elsif ($cmd eq "checkauth") { ! @_ = split(' ',$args); $cmd = $_[0]; $user = $clients{$cl}->{"user"}; # Note that we call check_auth without syslogging here. *************** *** 2614,2619 **** --- 2645,2653 ---- my $summary = esc_str ($sref->{"_last_summary"}, 1); my $detail = esc_str ($sref->{"_last_detail"}, 1); my $depend = esc_str ($sref->{"depend"}, 1); + my $hostdepend = esc_str ($sref->{"hostdepend"}, 1); + my $monitordepend = esc_str ($sref->{"monitordepend"}, 1); + my $alertdepend = esc_str ($sref->{"alertdepend"}, 1); my $monitor = esc_str ($sref->{"monitor"}, 1); my $comment; *************** *** 2629,2652 **** $alerts_sent += $sref->{"periods"}->{$period}->{"_alert_sent"}; } ! my $buf = ! "group=$group" . ! " service=$service" . ! " opstatus=$sref->{_op_status}" . ! " last_opstatus=$sref->{_last_op_status}" . ! " exitval=$sref->{_exitval}" . ! " timer=$sref->{_timer}" . ! " last_success=$sref->{_last_success}" . ! " last_trap=$sref->{_last_trap}" . ! " last_check=$sref->{_last_check}" . ! " ack=$sref->{_ack}" . ! " ackcomment='$comment'" . ! " alerts_sent=$alerts_sent" . ! " depstatus=" . int ($sref->{"_depend_status"}) . ! " depend='$depend'" . ! " monitor='$monitor'" . ! " last_summary='$summary'" . ! " last_detail='$detail'"; $buf .= " last_failure=$sref->{_last_failure}" if ($sref->{"_last_failure"}); --- 2663,2687 ---- $alerts_sent += $sref->{"periods"}->{$period}->{"_alert_sent"}; } ! my $buf = "group=$group service=$service opstatus=$sref->{_op_status}"; ! $buf .= " last_opstatus=" . (defined $sref->{_last_op_status} ? $sref->{_last_op_status} : ""); ! $buf .= " exitval=" . (defined $sref->{_exitval} ? $sref->{_exitval} : ""); ! $buf .= " timer=" . (defined $sref->{_timer} ? $sref->{_timer} : ""); ! $buf .= " last_success=" . (defined $sref->{_last_success} ? $sref->{_last_success} : ""); ! $buf .= " last_trap=" . (defined $sref->{_last_trap} ? $sref->{_last_trap} : ""); ! $buf .= " last_traphost=" . (defined $sref->{_last_traphost} ? $sref->{_last_traphost} : ""); ! $buf .= " last_check=" . (defined $sref->{_last_check} ? $sref->{_last_check} : ""); ! $buf .= " ack=" . (defined $sref->{_ack} ? $sref->{_ack} : ""); ! $buf .= " ackcomment='$comment'"; ! $buf .= " alerts_sent=$alerts_sent"; ! $buf .= " depstatus=" . (defined $sref->{"_depend_status"} ? int ($sref->{"_depend_status"}) : ""); ! $buf .= " depend='$depend'"; ! $buf .= " hostdepend='$hostdepend'"; ! $buf .= " monitordepend='$monitordepend'"; ! $buf .= " alertdepend='$alertdepend'"; ! $buf .= " monitor='$monitor'"; ! $buf .= " last_summary='$summary'"; ! $buf .= " last_detail='$detail'"; $buf .= " last_failure=$sref->{_last_failure}" if ($sref->{"_last_failure"}); *************** *** 2763,2768 **** --- 2798,2804 ---- exit (1); } + print N "Mon starting at ".localtime(time)."\n"; if (!open(STDOUT, ">&N") || !open (STDIN, "<&N") || !open (STDERR, ">&N")) { *************** *** 2779,2785 **** sub debug { my ($level, @l) = @_; ! return if ($level > $opt{"d"}); if ($opt{"d"} && !$opt{"f"}) { print STDERR @l; --- 2815,2821 ---- sub debug { my ($level, @l) = @_; ! return if (!defined $opt{"d"} || $level > $opt{"d"}); if ($opt{"d"} && !$opt{"f"}) { print STDERR @l; *************** *** 2832,2841 **** $sref->{"_last_checked"} = $tmnow; ! if ($sref->{"depend"} ne "" && ! $sref->{"dep_behavior"} eq "a") { ! dep_ok ($sref); } # --- 2868,2878 ---- $sref->{"_last_checked"} = $tmnow; ! if ((defined $sref->{"depend"} && $sref->{"depend"} ne "" && ! $sref->{"dep_behavior"} eq "a") ! || (defined $sref->{"alertdepend"} && $sref->{"alertdepend"} ne "")) { ! dep_ok ($sref, 'a'); } # *************** *** 2874,2880 **** # change interval if needed # if (defined ($sref->{"failure_interval"}) && ! $sref->{"_old_interval"} == undef) { $sref->{"_old_interval"} = $sref->{"interval"}; $sref->{"interval"} = $sref->{"failure_interval"}; --- 2911,2917 ---- # change interval if needed # if (defined ($sref->{"failure_interval"}) && ! !defined $sref->{"_old_interval"}) { $sref->{"_old_interval"} = $sref->{"interval"}; $sref->{"interval"} = $sref->{"failure_interval"}; *************** *** 2935,2941 **** # change interval back to original # if (defined ($sref->{"failure_interval"}) && ! $sref->{"_old_interval"} != undef) { $sref->{"interval"} = $sref->{"_old_interval"}; $sref->{"_old_interval"} = undef; --- 2972,2978 ---- # change interval back to original # if (defined ($sref->{"failure_interval"}) && ! defined $sref->{"_old_interval"}) { $sref->{"interval"} = $sref->{"_old_interval"}; $sref->{"_old_interval"} = undef; *************** *** 3069,3074 **** --- 3106,3129 ---- @ghosts = @g; } + # + # per-host dependencies + # + if ((defined $sref->{"depend"} && $sref->{"depend"} ne "" && +$sref->{"dep_behavior"} eq 'hm') + || (defined $sref->{"hostdepend"} && $sref->{"hostdepend"} ne "")) + { + my @g = (); + my $sum = dep_summary($sref); + + for (my $i=0; $i<@ghosts; $i++) + { + push (@g, $ghosts[$i]) + if (! grep /\Q$ghosts[$i]\E/, @$sum); + } + + @ghosts = @g; + } + @args = (quotewords ('\s+', 0, $monitor), @ghosts); } *************** *** 3094,3105 **** foreach $v (keys %{$sref->{"ENV"}}) { $ENV{$v} = $sref->{"ENV"}->{$v}; } ! $ENV{"MON_LAST_SUMMARY"} = $sref->{"_last_summary"}; ! $ENV{"MON_LAST_OUTPUT"} = $sref->{"_last_output"}; ! $ENV{"MON_LAST_FAILURE"} = $sref->{"_last_failure"}; ! $ENV{"MON_FIRST_FAILURE"} = $sref->{"_first_failure"}; ! $ENV{"MON_DEPEND_STATUS"} = $sref->{"_depend_status"}; ! $ENV{"MON_LAST_SUCCESS"} = $sref->{"_last_success"}; $ENV{"MON_STATEDIR"} = $CF{"STATEDIR"}; $ENV{"MON_LOGDIR"} = $CF{"LOGDIR"}; exec @args or syslog ('err', "could not exec '@args': $!") --- 3149,3160 ---- foreach $v (keys %{$sref->{"ENV"}}) { $ENV{$v} = $sref->{"ENV"}->{$v}; } ! $ENV{"MON_LAST_SUMMARY"} = $sref->{"_last_summary"} if (defined $sref->{"_last_summary"}); ! $ENV{"MON_LAST_OUTPUT"} = $sref->{"_last_output"} if (defined $sref->{"_last_output"}); ! $ENV{"MON_LAST_FAILURE"} = $sref->{"_last_failure"} if (defined $sref->{"_last_failure"}); ! $ENV{"MON_FIRST_FAILURE"} = $sref->{"_first_failure"} if (defined $sref->{"_first_failure"}); ! $ENV{"MON_DEPEND_STATUS"} = $sref->{"_depend_status"} if (defined $sref->{"_depend_status"}); ! $ENV{"MON_LAST_SUCCESS"} = $sref->{"_last_success"} if (defined $sref->{"_last_success"}); $ENV{"MON_STATEDIR"} = $CF{"STATEDIR"}; $ENV{"MON_LOGDIR"} = $CF{"LOGDIR"}; exec @args or syslog ('err', "could not exec '@args': $!") *************** *** 3248,3254 **** my $found = undef; foreach my $g (keys %groups) { ! if ($cmd == 0) { if (grep (s/^$h$/*$h/, @{$groups{$g}})) { $found = 1; --- 3303,3309 ---- my $found = undef; foreach my $g (keys %groups) { ! if ((!defined $cmd) || $cmd == 0) { if (grep (s/^$h$/*$h/, @{$groups{$g}})) { $found = 1; *************** *** 3306,3316 **** } } foreach $group (keys %watch) { ! if ($watch_disabled{$group} == 1) { print STATE "disable watch $group\n"; } foreach $service (keys %{$watch{$group}}) { ! if ($watch{$group}->{$service}->{'disable'} == 1) { print STATE "disable service $group $service\n"; } } --- 3361,3372 ---- } } foreach $group (keys %watch) { ! if (exists $watch_disabled{$group} && $watch_disabled{$group} == 1) { print STATE "disable watch $group\n"; } foreach $service (keys %{$watch{$group}}) { ! if (defined $watch{$group}->{$service}->{'disable'} ! && $watch{$group}->{$service}->{'disable'} == 1) { print STATE "disable service $group $service\n"; } } *************** *** 3324,3333 **** } foreach $group (keys %watch) { foreach $service (keys %{$watch{$group}}) { ! print STATE "group=$group service=$service" . ! " op_status=$watch{$group}->{$service}->{_op_status}" . ! " failure_count=$watch{$group}->{$service}->{_failure_count}" . ! " alert_count=\n"; } } close (STATE); --- 3380,3398 ---- } foreach $group (keys %watch) { foreach $service (keys %{$watch{$group}}) { ! print STATE "group=$group\tservice=$service"; ! foreach my $var (qw(op_status failure_count alert_count last_success ! consec_failures last_failure first_failure last_summary ! last_detail ack ack_comment last_trap last_traphost exitval ! last_check last_op_status)) { ! print STATE "\t$var=" . esc_str($watch{$group}->{$service}->{"_$var"}); ! } ! foreach my $periodlabel (keys %{$watch{$group}->{$service}->{periods}}) { ! foreach my $var (qw(last_alert alert_sent 1stfailtime failcount)) { ! print STATE "\t$periodlabel:$var=" . esc_str($watch{$group}->{$service}{periods}{$periodlabel}{"_$var"}); ! } ! } ! print STATE "\n"; } } close (STATE); *************** *** 3344,3350 **** my ($l, $cmd, $args, $group, $service, $what, $state); foreach $state (@states) { ! if ($state eq "disabled") { if (!open (STATE, "$CF{STATEDIR}/disabled")) { syslog ("err", "could not read state file: $!"); next; --- 3409,3415 ---- my ($l, $cmd, $args, $group, $service, $what, $state); foreach $state (@states) { ! if ($state eq "disabled" || $state eq "all") { if (!open (STATE, "$CF{STATEDIR}/disabled")) { syslog ("err", "could not read state file: $!"); next; *************** *** 3372,3377 **** --- 3437,3468 ---- syslog ("info", "state '$state' loaded"); close (STATE); } + + if ($state eq "opstatus" || $state eq "all") { + if (!open (STATE, "$CF{STATEDIR}/opstatus")) { + syslog ("err", "could not read state file: $!"); + next; + } + + while (defined ($l = <STATE>)) { + chomp $l; + my %opstatus = map{ /^(.*)=(.*)$/; $1 => $2} split (/\t/, $l,); + next unless (exists $opstatus{group} && exists +$watch{$opstatus{group}} + && exists $opstatus{service} && exists +$watch{$opstatus{group}}->{$opstatus{service}}); + + foreach my $op (keys %opstatus) { + next if ($op eq 'group' || $op eq 'service'); + if ($op =~ /^(.*):(.*)$/) { + next unless exists +$watch{$opstatus{group}}->{$opstatus{service}}{periods}{$1}; + +$watch{$opstatus{group}}->{$opstatus{service}}{periods}{$1}{"_$2"} = +un_esc_str($opstatus{$op}); + } else { + $watch{$opstatus{group}}->{$opstatus{service}}{"_$op"} = +un_esc_str($opstatus{$op}); + } + } + } + syslog ("info", "state '$state' loaded"); + close (STATE); + } } } *************** *** 3519,3531 **** # allow traps from all hosts # ! } elsif ($host =~ /^[a-z]/) { ! if (($host = inet_aton ($host)) eq "") { syslog ('err', "invalid host in $CF{AUTHFILE}, line $."); next; } ! } elsif ($host =~ /^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$/) { ! if (($host = inet_aton ($host)) eq "") { syslog ('err', "invalid host in $CF{AUTHFILE}, line $."); next; } --- 3610,3622 ---- # allow traps from all hosts # ! } elsif ($host =~ /^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$/) { ! if (($host = inet_aton ($host)) eq "") { syslog ('err', "invalid host in $CF{AUTHFILE}, line $."); next; } ! } elsif ($host =~ /^[A-Z\d][[A-Z\.\d\-]*[[A-Z\d]+$/i) { ! if (($host = inet_aton ($host)) eq "") { syslog ('err', "invalid host in $CF{AUTHFILE}, line $."); next; } *************** *** 3539,3544 **** --- 3630,3636 ---- $host = inet_ntoa ($host); } + syslog ('notice', "Adding trap auth of: $host $user $password"); $AUTHTRAPS{$host}{$user} = $password; } elsif ($sect eq "snmptrap") { *************** *** 3793,3801 **** else { ! $traphost = $addr; } if (defined ($AUTHTRAPS{$traphost}{"*"})) { $trapuser = "*"; --- 3885,3899 ---- else { ! $traphost = $fromip; } + if (!defined ($AUTHTRAPS{$traphost})) + { + syslog ('err', "received trap from unauthorized host: $fromip"); + return undef; + } + if (defined ($AUTHTRAPS{$traphost}{"*"})) { $trapuser = "*"; *************** *** 3808,3825 **** $trappass = $trap{"pas"}; } ! if (!defined ($AUTHTRAPS{$traphost})) ! { ! syslog ('err', "received trap from unauthorized host: $fromip"); ! return undef; ! } ! ! if ($trapuser ne "*" && crypt ($trappass, $AUTHTRAPS{$traphost}{$trapuser}) ne ! $AUTHTRAPS{$traphost}{$trapuser}) ! { ! syslog ('err', "received trap from unauthorized user $trapuser, host $traphost"); ! return undef; } # --- 3906,3919 ---- $trappass = $trap{"pas"}; } ! if ($trapuser ne "*") { ! if (!defined $AUTHTRAPS{$traphost}{$trapuser} || crypt ($trappass, $AUTHTRAPS{$traphost}{$trapuser}) ne ! $AUTHTRAPS{$traphost}{$trapuser}) ! { ! syslog ('err', "received trap from unauthorized user $trapuser, host $traphost"); ! return undef; ! } } # *************** *** 3876,3881 **** --- 3970,3976 ---- $sref->{"_last_trap"} = $time; $sref->{"_last_detail"} = $trap{"dtl"}; $sref->{"_last_summary"} = $trap{"sum"}; + $sref->{"_last_traphost"} = $fromip; if ($intended) { *************** *** 3884,3890 **** my $old_status = $sref->{"_op_status"}; ! syslog ('info', "trap $trap{typ} $trap{spc} from " . "$fromip for $trap{grp} $trap{svc}, status $trap{sta}"); my $group = $trap{"grp"}; --- 3979,3985 ---- my $old_status = $sref->{"_op_status"}; ! syslog ('debug', "trap $trap{typ} $trap{spc} from " . "$fromip for $trap{grp} $trap{svc}, status $trap{sta}"); my $group = $trap{"grp"}; *************** *** 3969,3978 **** push @last_failures, "$trap{grp} $trap{svc}" . " $tm $trap{typ} $trap{spc} $trap{sum}"; ! if ($sref->{"depend"} ne "" && ! $sref->{"dep_behavior"} eq "a") { ! dep_ok ($sref); } # --- 4064,4074 ---- push @last_failures, "$trap{grp} $trap{svc}" . " $tm $trap{typ} $trap{spc} $trap{sum}"; ! if ((defined $sref->{"depend"} && $sref->{"depend"} ne "" && ! $sref->{"dep_behavior"} eq "a") ! || (defined $sref->{"alertdepend"} && $sref->{"alertdepend"} ne "")) { ! dep_ok ($sref, 'a'); } # *************** *** 4026,4031 **** --- 4122,4128 ---- { $sref->{"periods"}->{$period}->{"_last_alert"} = 0; $sref->{"periods"}->{$period}->{"_alert_sent"} = 0; + $sref->{"periods"}->{$period}->{"_1stfailtime"} = 0; } } else { $sref->{"_failure_output"} = $trap{"sum"} . $trap{"dtl"}; *************** *** 4050,4056 **** --- 4147,4155 ---- $tmnow = time; my $sref = \%{$watch{$group}->{$service}}; + dep_ok ($sref, 'a'); $sref->{"_failure_count"}++; + $sref->{"_consec_failures"}++; $sref->{"_last_failure"} = $tmnow; $sref->{"_first_failure"} = $tmnow if ($sref->{"_op_status"} != $STAT_FAIL); set_op_status ($group, $service, $STAT_FAIL); *************** *** 4060,4066 **** push @last_failures, "$group $service $tm $sref->{_last_summary}"; syslog ('crit', "failure for $last_failures[-1]"); ! do_alert ($group, $service, undef, undef, $FL_TRAPTIMEOUT); } --- 4159,4165 ---- push @last_failures, "$group $service $tm $sref->{_last_summary}"; syslog ('crit', "failure for $last_failures[-1]"); ! do_alert ($group, $service, "trap timeout\n", -1, $FL_TRAPTIMEOUT); } *************** *** 4521,4535 **** $sref->{"_timer"} = $sref->{"interval"} if ($sref->{"interval"}); foreach my $period (keys %{$sref->{"periods"}}) { my $pref = \%{$sref->{"periods"}->{$period}}; $pref->{"_last_alert"} = 0 if ($pref->{"alertevery"}); - $pref->{"_consec_failures"} = 0 - if ($pref->{"alertafter_consec"}); - $pref->{'_1stfailtime'} = 0 if ($pref->{"alertafterival"}); } --- 4620,4634 ---- $sref->{"_timer"} = $sref->{"interval"} if ($sref->{"interval"}); + $sref->{"_consec_failures"} = 0 + if ($sref->{"_consec_failures"}); + foreach my $period (keys %{$sref->{"periods"}}) { my $pref = \%{$sref->{"periods"}->{$period}}; $pref->{"_last_alert"} = 0 if ($pref->{"alertevery"}); $pref->{'_1stfailtime'} = 0 if ($pref->{"alertafterival"}); } *************** *** 4597,4603 **** my $tmnow = time; my ($summary) = split("\n", $args{"output"}); ! $summary = "(NO SUMMARY)" if ($summary =~ /^\s*$/m); my $sref = \%{$watch{$args{"group"}}->{$args{"service"}}}; my $pref; --- 4696,4702 ---- my $tmnow = time; my ($summary) = split("\n", $args{"output"}); ! $summary = "(NO SUMMARY)" if (!defined $summary || $summary =~ /^\s*$/m); my $sref = \%{$watch{$args{"group"}}->{$args{"service"}}}; my $pref; *************** *** 4606,4611 **** --- 4705,4714 ---- $pref = $args{"pref"}; } + if (! defined $args{"args"}) { + $args{"args"} = ''; + } + my $alert = ""; if (!defined $ALERTHASH{$args{"alert"}} || ! -f $ALERTHASH{$args{"alert"}}) { *************** *** 4661,4676 **** $ENV{$v} = $sref->{"ENV"}->{$v}; } ! $ENV{"MON_LAST_SUMMARY"} = $sref->{"_last_summary"}; ! $ENV{"MON_LAST_OUTPUT"} = $sref->{"_last_output"}; ! $ENV{"MON_LAST_FAILURE"} = $sref->{"_last_failure"}; ! $ENV{"MON_FIRST_FAILURE"} = $sref->{"_first_failure"}; ! $ENV{"MON_LAST_SUCCESS"} = $sref->{"_last_success"}; ! $ENV{"MON_DESCRIPTION"} = $sref->{"description"}; ! $ENV{"MON_GROUP"} = $args{"group"}; ! $ENV{"MON_SERVICE"} = $args{"service"}; ! $ENV{"MON_RETVAL"} = $args{"retval"}; ! $ENV{"MON_OPSTATUS"} = $sref->{"_op_status"}; $ENV{"MON_ALERTTYPE"} = $alert_type; $ENV{"MON_STATEDIR"} = $CF{"STATEDIR"}; $ENV{"MON_LOGDIR"} = $CF{"LOGDIR"}; --- 4764,4779 ---- $ENV{$v} = $sref->{"ENV"}->{$v}; } ! $ENV{"MON_LAST_SUMMARY"} = $sref->{"_last_summary"} if (defined $sref->{"_last_summary"}); ! $ENV{"MON_LAST_OUTPUT"} = $sref->{"_last_output"} if (defined $sref->{"_last_output"}); ! $ENV{"MON_LAST_FAILURE"} = $sref->{"_last_failure"} if (defined $sref->{"_last_failure"}); ! $ENV{"MON_FIRST_FAILURE"} = $sref->{"_first_failure"} if (defined $sref->{"_first_failure"}); ! $ENV{"MON_LAST_SUCCESS"} = $sref->{"_last_success"} if (defined $sref->{"_last_success"}); ! $ENV{"MON_DESCRIPTION"} = $sref->{"description"} if (defined $sref->{"description"}); ! $ENV{"MON_GROUP"} = $args{"group"} if (defined $args{"group"}); ! $ENV{"MON_SERVICE"} = $args{"service"} if (defined $args{"service"}); ! $ENV{"MON_RETVAL"} = $args{"retval"} if (defined $args{"retval"}); ! $ENV{"MON_OPSTATUS"} = $sref->{"_op_status"} if (defined $sref->{"_op_status"}); $ENV{"MON_ALERTTYPE"} = $alert_type; $ENV{"MON_STATEDIR"} = $CF{"STATEDIR"}; $ENV{"MON_LOGDIR"} = $CF{"LOGDIR"}; *************** *** 4774,4780 **** # } # sub depend { ! my ($depend, $depth) = @_; debug (1, "checking DEP [$depend]\n"); if ($depth > $CF{"DEP_RECUR_LIMIT"}) { --- 4877,4883 ---- # } # sub depend { ! my ($depend, $depth, $deptype) = @_; debug (1, "checking DEP [$depend]\n"); if ($depth > $CF{"DEP_RECUR_LIMIT"}) { *************** *** 4791,4801 **** my $sref = \%{$watch{$group}->{$service}}; my $depval = undef; # # disabled watches and services are counted as "passing" # ! if ($watch_disabled{$group} || $sref->{"disable"} == 1) { $depval = 1; --- 4894,4912 ---- my $sref = \%{$watch{$group}->{$service}}; my $depval = undef; + my $subdepend = ""; + if (defined $sref->{"depend"} && $sref->{"dep_behavior"} eq $deptype) { + $subdepend = $sref->{"depend"}; + } elsif ($deptype eq 'a' && defined $sref->{"alertdepend"}) { + $subdepend = $sref->{"alertdepend"}; + } elsif ($deptype eq 'm' && defined $sref->{"monitordepend"}) { + $subdepend = $sref->{"monitordepend"}; + } # # disabled watches and services are counted as "passing" # ! if ((exists $watch_disabled{$group} && $watch_disabled{$group}) || (defined $sref->{"disable"} && $sref->{"disable"} == 1)) { $depval = 1; *************** *** 4803,4809 **** # root dependency found # } ! elsif ($sref->{"depend"} eq "") { debug (1, " found root dep $group,$service\n"); --- 4914,4920 ---- # root dependency found # } ! elsif ($subdepend eq "") { debug (1, " found root dep $group,$service\n"); *************** *** 4818,4824 **** # # do it recursively # ! my $dstatus = depend ($sref->{"depend"}, $depth + 1); debug (1, "recur depth $depth returned $dstatus->{status},$dstatus->{depend}\n"); --- 4929,4935 ---- # # do it recursively # ! my $dstatus = depend ($subdepend, $depth + 1, $deptype); debug (1, "recur depth $depth returned $dstatus->{status},$dstatus->{depend}\n"); *************** *** 4874,4881 **** sub dep_ok { my $sref = shift; ! my $s = depend ($sref->{"depend"}, 0); if ($s->{"status"} eq "D") { --- 4985,5003 ---- sub dep_ok { my $sref = shift; + my $deptype = shift; + my $depend = ""; + if (defined $sref->{"depend"} && $sref->{"dep_behavior"} eq $deptype) { + $depend = $sref->{"depend"}; + } elsif ($deptype eq 'a' && defined $sref->{"alertdepend"}) { + $depend = $sref->{"alertdepend"}; + } elsif ($deptype eq 'm' && defined $sref->{"monitordepend"}) { + $depend = $sref->{"monitordepend"}; + } + + return 1 unless ($depend ne ""); ! my $s = depend ($depend, 0, $deptype); if ($s->{"status"} eq "D") { *************** *** 4901,4906 **** --- 5023,5060 ---- # + # returns undef on error + # otherwise a reference to a list summaries from all + # DIRECT dependencies currently failing + sub dep_summary + { + my $sref = shift; + my @sum; + my @deps = (); + + if (defined $sref->{"depend"} && $sref->{"dep_behavior"} eq "hm") { + @deps = ($sref->{"depend"} =~ /[a-zA-Z0-9_.-]+:[a-zA-Z0-9_.-]+/g); + } elsif (defined $sref->{"hostdepend"}) { + @deps = ($sref->{"hostdepend"} =~ /[a-zA-Z0-9_.-]+:[a-zA-Z0-9_.-]+/g); + } + + return [] if (! @deps); + + foreach (@deps) { + my ($group, $service) = split /:/; + if (!(exists $watch{$group} && exists $watch{$group}->{$service})) { + return undef; + } + + if ($watch{$group}->{$service}{"_op_status"} == $STAT_FAIL) { + push @sum, $watch{$group}->{$service}{"_last_summary"}; + } + } + + return \@sum; + } + + # # convert a string to a hex-escaped string, returning # the escaped string. # *************** *** 4915,4921 **** my $inquotes = shift; my $escstr = ""; ! for (my $i = 0; $i < length ($str); $i++) { my $c = substr ($str, $i, 1); --- 5069,5075 ---- my $inquotes = shift; my $escstr = ""; ! return $escstr if (!defined $str); for (my $i = 0; $i < length ($str); $i++) { my $c = substr ($str, $i, 1);
I'm going to use the same basic format for these comments as in my last set. First the changes list, then the detailed mapping of patch sections to changes. The changes are: 1. Added full support for saving/loading full opstatus information. 2. Added support for specifying which type(s) of state to load when mon is started with the -l switch. 3. Added new dependency behavior type 'hm', for per-host monitor suppression. 4. Added the ability to have multiple dependency expressions associated with a single watch/service. This added three new mon.cfg keywords 'alertdepend', 'monitordepend', and 'hostdepend'. 5. Fixed some bugs with trap authentication checking where traps from any host were being allowed. 6. Fixed a couple bugs that was preventing traptimeouts from sending alerts when there was a dependency involved, or an alertafter statement. 7. *Lots* of little changes to make 'perl -w' happy with mon. As a side effect of this, the memory leak problems I was having seem to have gone away. 8. Added code to track what host a trap comes from 9. Fixed a couple bugs where things weren't getting reset after an up trap. And here's the per-section annotation: 65: 3 198: 2 343: 2 369: 7 384: 6 411: 4 & 7 530: 7 540: 4 & 7 562: 7 1002: 3 1208: 8 1472: 3 1484: 4 1594: 1 1608: 1 1678: 7 2193: 7 2243: 7 2583: 7 (Implicit assigning to @_ with split is deprecated) 2614: 4 2629: 4 & 7 2763: Added log entry for mon restarts 2779: 7 2832: 4 & 7 2874: 7 2935: 7 3069: 3 & 4 3094: 7 3248: 7 3306: 7 3324: 1 3344: 2 3372: 1 & 2 3519: 5 3539: 5 3793: 5 3808: 5 3876: 8 3884: Reduced the syslog logging level of the trap logging 3969: 4 4026: 9 4050: 6 4060: 6 4521: 9 4597: 7 4606: 7 4661: 7 4774: 4 4791: 4 4803: 4 4818: 4 4874: 4 4901: 3 & 4 4915: 7