We use MON to restart a flakey application that one
of our webservers depends on. When that application hangs, webpages
are no longer served properly.
The monitoring process goes something like
this. MON checks the password protected website. It searches
for a string that confirms that the webpage is being served properly. If
that string is missing, MON runs the related "alert" which is really just a
script that uses SSH to restart the application.
With a little upalert and alertafter magic, a
proper restart simply sends us an upalert-email. But if the restart isn't
successful it will page us so we can manually intervene. This process has
really reduced the number of flakeyapp related pages we receive.
I've included a snippet of our MON configuration as
well as the simple exec.action alert that we use. NOTE: In our case the
exec.action script depends on a nullpassphrase ssh session, but you could
execute any sort of program.
Many thanks to Jim Trocki!
-Lewis
----SNIP----
hostgroup prod_flakeyapp https://flakeyweb.somedomain.com/login
watch
prod_flakeyapp
service flakeyapp description flakeyapp in production interval 15m failure_interval 5m monitor http-pr.monitor -h -p id=username,pwd=somepass -r 'qr/Match String/' exclude_period _SCHEDULED_MAINT_ period P1: _ANYTIME_ # this alert attempts to restart the flakeyapp after a second # consecutive failure alertafter 2 # try to restart once, prevents two restart conflict numalerts 1 alert exec.action -U '/usr/local/bin/ssh appuser@%HOST% "/etc/init.d/flakeyapp stop;sleep 1;/etc/init.d/flakeyapp start"' upalert ops.alert -u _EMAIL_ _APPOWNER_EMAIL_ period P2: _ANYTIME_ # this alert occurs if the restart failed alertafter 4 alertevery 1h alert page.alert _PAGER_ alert ops.alert _EMAIL_ ----SNIP----
http-pr.monitor is an inhouse script that supports
SSL and password webpages. Anything with a _ prepended and appended, is an
m4 processed variable.
exec.action is below:
----SNIP----
#!/usr/bin/perl
#*===========================================================================* # # CVS: "$Id: exec.action,v 1.1 2002/02/02 02:21:35 leb Exp $" # Name: $Name$ # # BY: # Lewis Burgess - (28-Jan-02) # # DESCRIPTION: # This script is a generic way to perform some action. # #*===========================================================================* #------------------------------------------------------------------------------
# Module includes #------------------------------------------------------------------------------ require 5.6.1;
use strict;
use Getopt::Std;
#------------------------------------------------------------------------------
# Global Variables #------------------------------------------------------------------------------ my $VERSION = do { my @r=(q$Revision: 1.1
$=~/\d+/g); sprintf "%d."."%02d"x$#r,@r };
my $ALERT;
my $GROUP; my $MEMBERS; my $NEXTALERT; my $SERVICE; my $SUMMARY; my $TIME; my $TRAP; my $TRAPTIMEOUT; my $URL; #------------------------------------------------------------------------------
# Read command line options/macros. Overrides the above pre-defined macros. #------------------------------------------------------------------------------ &usage unless
getopts('OTUg:h:t:l:s:t:uv');
$TRAPTIMEOUT = $main::opt_O if
$main::opt_O;
$TRAP=1 if $main::opt_T; $URL=1 if $main::opt_U; $GROUP = $main::opt_g if
$main::opt_g;
$MEMBERS = $main::opt_h if $main::opt_h; $NEXTALERT = $main::opt_l if $main::opt_l; $SERVICE = $main::opt_s if $main::opt_s; $TIME = localtime($main::opt_t) if $main::opt_t; $ALERT = ($main::opt_u ? "UPALERT" : "ALERT"); &version if $main::opt_v;
#-----------------------------------------------------------------------------
# Functions #----------------------------------------------------------------------------- sub usage
{ print STDERR "Usage: ${0} [-O] [-T] [-U] [-g <group>] [-h <members>] [-l <sec>] [-s <service>] [-t <time>] [-u] [-v] <arguments>\n"; print STDERR "Options:\n"; print STDERR " -O triggered by expect trap timeout\n"; print STDERR " -T alert triggered by a trap\n"; print STDERR " -U targets are URL(s) (instead of hosts)\n"; print STDERR " -g <group> host group (from config file)\n"; print STDERR " -h <members> list of space delimited hosts\n"; print STDERR " -l <secs> seconds until next alert sent\n"; print STDERR " -s <service> host service (from config file)\n"; print STDERR " -t <time> time (time(2) format) when failure detected\n"; print STDERR " -u true if an upalert\n"; print STDERR " -v program version\n"; print STDERR "\n"; print STDERR "ENVIRONMENT VARIABLES\n"; print STDERR " MON_LAST_SUMMARY: first line of the output from the last time the monitor exited\n"; print STDERR " MON_LAST_OUTPUT: entire output of the monitor from the last time it exited\n"; print STDERR " MON_LAST_FAILURE: time(2) of the last failure for this service\n"; print STDERR " MON_FIRST_FAILURE: time(2) of the first time this service failed\n"; print STDERR " MON_LAST_SUCCESS: time(2) of the last time this service passed\n"; print STDERR " MON_DESCRIPTION: description of this service, as defined in the configuration file using the description tag\n"; print STDERR " MON_GROUP: watch group which triggered this alarm\n"; print STDERR " MON_SERVICE: service heading which generated this alert\n"; print STDERR " MON_RETVAL: exit value of the failed monitor program, or return value as accepted from a trap\n"; print STDERR " MON_OPSTATUS: operational status of the service\n"; print STDERR " MON_ALERTTYPE: one of the following values: 'failure', 'up', 'startup', 'trap', or 'traptimeout', and signifies the type of alert which was triggered\n"; print STDERR " MON_TRAP_INTENDED: set when an unknown mon trap is received and caught by the default/defaut watch/service. This contains colon separated entries of the trap's intended watch group and service name\n"; print STDERR " MON_LOGDIR: directory log files should be placed, as indicated by the logdir global configuration variable\n"; print STDERR " MON_STATEDIR: directory where state files should be kept, as indicated by the statedir global configuration variable\n"; print STDERR "STDIN:\n"; print STDERR " First line is the summary\n"; print STDERR " All other lines are the details\n"; print STDERR "PARSING:\n"; print STDERR " %HOST% is translated to the failed host. (-U allows failed URLs to be parsed for the host value\n"; print STDERR " %URL% is translated to the failed URL (if -U)\n"; print STDERR "\n";
exit(0);
} # usage #----------------------------------------------------------------------------- sub version { print q$Id: exec.action,v 1.1 2002/02/02 02:21:35 leb Exp $ . "\n"; print q$Name$ . "\n"; exit(0); } # version #-----------------------------------------------------------------------------
# Begin #----------------------------------------------------------------------------- my $template = $ARGV[0];
my $result; chomp($SUMMARY =
<STDIN>);
foreach my $host (split(' ',
$SUMMARY))
{ my $command; $command = $template;
$command =~ s/%URL%/$host/g if
$URL;
# parse out the hostname if it is a
URL
($host) = $host =~ m|\w+://([^/:]+)| if $URL; $command =~
s/%HOST%/$host/g;
# execute the command
`$command`; $result ||= $?;
} exit $result;
----SNIP----
|
_______________________________________________ mon mailing list [EMAIL PROTECTED] http://linux.kernel.org/mailman/listinfo/mon