We use MON to restart a flakey application that one of our webservers depends on.  When that application hangs, webpages are no longer served properly.
 
The monitoring process goes something like this.  MON checks the password protected website.  It searches for a string that confirms that the webpage is being served properly.  If that string is missing, MON runs the related "alert" which is really just a script that uses SSH to restart the application.
 
With a little upalert and alertafter magic, a proper restart simply sends us an upalert-email.  But if the restart isn't successful it will page us so we can manually intervene.  This process has really reduced the number of flakeyapp related pages we receive.
 
I've included a snippet of our MON configuration as well as the simple exec.action alert that we use.  NOTE: In our case the exec.action script depends on a nullpassphrase ssh session, but you could execute any sort of program.
 
Many thanks to Jim Trocki!
 
-Lewis
 
 
----SNIP----
 
watch prod_flakeyapp
      service flakeyapp
       description flakeyapp in production
       interval 15m
       failure_interval 5m
       monitor http-pr.monitor -h -p id=username,pwd=somepass -r 'qr/Match String/'
       exclude_period _SCHEDULED_MAINT_
       period P1: _ANYTIME_
       # this alert attempts to restart the flakeyapp after a second
       # consecutive failure
       alertafter 2
       # try to restart once, prevents two restart conflict  
       numalerts 1
       alert exec.action -U '/usr/local/bin/ssh appuser
@%HOST% "/etc/init.d/flakeyapp stop;sleep 1;/etc/init.d/flakeyapp start"'
       upalert ops.alert -u _EMAIL_ _APPOWNER_EMAIL_
       period P2: _ANYTIME_
       # this alert occurs if the restart failed
       alertafter 4
       alertevery 1h
       alert page.alert _PAGER_
       alert ops.alert _EMAIL_
----SNIP----
 
http-pr.monitor is an inhouse script that supports SSL and password webpages.  Anything with a _ prepended and appended, is an m4 processed variable.
 
exec.action is below:
 
----SNIP----
#!/usr/bin/perl
#*===========================================================================*
#
# CVS: "$Id: exec.action,v 1.1 2002/02/02 02:21:35 leb Exp $"
# Name: $Name$
#
# BY:
#     Lewis Burgess - (28-Jan-02)
#
# DESCRIPTION:
# This script is a generic way to perform some action.
#
#*===========================================================================*
 
#------------------------------------------------------------------------------
# Module includes
#------------------------------------------------------------------------------
 
require 5.6.1;
 
use strict;
 
use Getopt::Std;
 
#------------------------------------------------------------------------------
# Global Variables
#------------------------------------------------------------------------------
 
my $VERSION = do { my @r=(q$Revision: 1.1 $=~/\d+/g); sprintf "%d."."%02d"x$#r,@r };
 
my $ALERT;
my $GROUP;
my $MEMBERS;
my $NEXTALERT;
my $SERVICE;
my $SUMMARY;
my $TIME;
my $TRAP;
my $TRAPTIMEOUT;
my $URL;
 
#------------------------------------------------------------------------------
# Read command line options/macros.  Overrides the above pre-defined macros.
#------------------------------------------------------------------------------
 
&usage unless getopts('OTUg:h:t:l:s:t:uv');
 
$TRAPTIMEOUT = $main::opt_O if $main::opt_O;
$TRAP=1 if $main::opt_T;
$URL=1 if $main::opt_U;
 
$GROUP = $main::opt_g if $main::opt_g;
$MEMBERS = $main::opt_h if $main::opt_h;
$NEXTALERT = $main::opt_l if $main::opt_l;
$SERVICE = $main::opt_s if $main::opt_s;
$TIME = localtime($main::opt_t) if $main::opt_t;
$ALERT = ($main::opt_u ? "UPALERT" : "ALERT");
 
&version if $main::opt_v;
 
#-----------------------------------------------------------------------------
# Functions
#-----------------------------------------------------------------------------
 
sub usage
{
  print STDERR "Usage: ${0} [-O] [-T] [-U] [-g <group>] [-h <members>] [-l <sec>] [-s <service>] [-t <time>] [-u] [-v] <arguments>\n";
  print STDERR "Options:\n";
  print STDERR "  -O                   triggered by expect trap timeout\n";
  print STDERR "  -T                   alert triggered by a trap\n";
  print STDERR "  -U                   targets are URL(s) (instead of hosts)\n";
  print STDERR "  -g <group>           host group (from config file)\n";
  print STDERR "  -h <members>         list of space delimited hosts\n";
  print STDERR "  -l <secs>            seconds until next alert sent\n";
  print STDERR "  -s <service>         host service (from config file)\n";
  print STDERR "  -t <time>            time (time(2) format) when failure detected\n";
  print STDERR "  -u                   true if an upalert\n";
  print STDERR "  -v                   program version\n";
  print STDERR "\n";
  print STDERR "ENVIRONMENT VARIABLES\n";
  print STDERR "  MON_LAST_SUMMARY: first line of the output from the last time the monitor exited\n";
  print STDERR "  MON_LAST_OUTPUT: entire output of the monitor from the last time it exited\n";
  print STDERR "  MON_LAST_FAILURE: time(2) of the last failure for this service\n";
  print STDERR "  MON_FIRST_FAILURE: time(2) of the first time this service failed\n";
  print STDERR "  MON_LAST_SUCCESS: time(2) of the last time this service passed\n";
  print STDERR "  MON_DESCRIPTION: description of this service, as defined in the configuration file using the description tag\n";
  print STDERR "  MON_GROUP: watch group which triggered this alarm\n";
  print STDERR "  MON_SERVICE: service heading which generated this alert\n";
  print STDERR "  MON_RETVAL: exit value of the failed monitor program, or return value as accepted from a trap\n";
  print STDERR "  MON_OPSTATUS: operational status of the service\n";
  print STDERR "  MON_ALERTTYPE: one of the following values: 'failure', 'up', 'startup', 'trap', or 'traptimeout', and signifies the type of alert which was triggered\n";
  print STDERR "  MON_TRAP_INTENDED: set when an unknown mon trap is received and caught by the default/defaut watch/service. This contains colon separated entries of the trap's intended watch group and service name\n";
  print STDERR "  MON_LOGDIR: directory log files should be placed, as indicated by the logdir global configuration variable\n";
  print STDERR "  MON_STATEDIR: directory where state files should be kept, as indicated by the statedir global configuration variable\n";
  print STDERR "STDIN:\n";
  print STDERR "  First line is the summary\n";
  print STDERR "  All other lines are the details\n";
  print STDERR "PARSING:\n";
  print STDERR "  %HOST% is translated to the failed host. (-U allows failed URLs to be parsed for the host value\n";
  print STDERR "  %URL% is translated to the failed URL (if -U)\n";
  print STDERR "\n";
 
  exit(0);
}    # usage
#-----------------------------------------------------------------------------
sub version
{
  print q$Id: exec.action,v 1.1 2002/02/02 02:21:35 leb Exp $ . "\n";
  print q$Name$ . "\n";
  exit(0);
}    # version
 
#-----------------------------------------------------------------------------
# Begin
#-----------------------------------------------------------------------------
 
my $template = $ARGV[0];
my $result;
 
chomp($SUMMARY = <STDIN>);
 
foreach my $host (split(' ', $SUMMARY))
{
  my $command;
 
  $command = $template;
 
  $command =~ s/%URL%/$host/g if $URL;
 
  # parse out the hostname if it is a URL
  ($host) = $host =~ m|\w+://([^/:]+)| if $URL;
 
  $command =~ s/%HOST%/$host/g;
 
  # execute the command
  `$command`;
 
  $result ||= $?;
}
 
exit $result;
----SNIP----
_______________________________________________
mon mailing list
[EMAIL PROTECTED]
http://linux.kernel.org/mailman/listinfo/mon

Reply via email to