Ori.livneh has uploaded a new change for review. https://gerrit.wikimedia.org/r/88009
Change subject: Add Icinga check for l10nupdate & drop !log-based alerts ...................................................................... Add Icinga check for l10nupdate & drop !log-based alerts Localisation update is the only job that uses !log to issue alerts, to my knowledge, and it does so indiscriminantly, spamming the server admin log with both successes and failures. It's not clear to me that anyone takes its failures very seriously, either. (The last run failed, for example.) This patch adds an Icinga plug-in that checks the status of the localisation cache. It issues a WARNING if the caches are >26 hours old, and a CRITICAL if over >50. The numbers were chosen because l10nupdate runs once a day and takes some arbitrary fraction of an hour to complete, and because failures are (in my anecdotal experience) typically given another shot to self-correct before being debugged. With the Icinga check in effect, it would not be necessary (or appropriate) for l10nupdate to use the SAL, so this patch also updates l10nupdate-1 to echo verbose log messages to standard out instead. (The l10nupdate cron job redirects stdout to a log file.) Change-Id: I6eca6c063319a2663bb25d76a92709103c9dd88a --- A files/icinga/check_l10n_cache M files/misc/l10nupdate/l10nupdate-1 M manifests/misc/deployment.pp M templates/icinga/checkcommands.cfg.erb 4 files changed, 76 insertions(+), 12 deletions(-) git pull ssh://gerrit.wikimedia.org:29418/operations/puppet refs/changes/09/88009/1 diff --git a/files/icinga/check_l10n_cache b/files/icinga/check_l10n_cache new file mode 100755 index 0000000..28dc285 --- /dev/null +++ b/files/icinga/check_l10n_cache @@ -0,0 +1,53 @@ +#!/bin/bash +# +# Icinga plug-in for Wikimedia's MediaWiki localisation cache. +# +# This check should run on the host that runs l10nupdate. It ensures +# that an up-to-date localisation cache directory exists for each +# deployed version of MediaWiki. The check will report +# +# OK - If localisation caches are up-to-date. +# WARNING - If the l10n cache files are more than 26 hours old. +# CRITICAL - If the l10n cache files are more than 50 hours old. +# UNKNOWN - If no MediaWiki versions are in use. +# UNKNOWN - If no l10n cache directory exists for a version. +# +. /usr/local/lib/mw-deployment-vars.sh +. $MW_COMMON_SOURCE/multiversion/MWRealm.sh + +versions=($(/usr/local/bin/mwversionsinuse)) + +if [ -z "$versions" ]; then + echo "UNKNOWN: mwversionsinuse returned an empty list" + exit 3 +fi + +missing=() +critical=() +warning=() + +for version in "${versions[@]}"; do + l10n_dir="${MW_COMMON_SOURCE}/php-${version}/cache/l10n" + if [ ! -d "$l10n_dir" ]; then + missing+=("$version") + elif [ ! `find "${l10n_dir}" -mtime -2.1 -print -quit` ]; then + critical+=("$version") + elif [ ! `find "${l10n_dir}" -mtime -1.1 -print -quit` ]; then + warning+=("$version") + fi +done + +IFS=, +if [ -n "$critical" ]; then + echo "CRITICAL: localisation cache is more than 50 hours old -- ${critical[*]}" + exit 2 +elif [ -n "$warning" ]; then + echo "WARNING: localisation cache is more than 50 hours old -- ${warning[*]}" + exit 1 +elif [ -n "$missing" ]; then + echo "UNKNOWN: localisation cache directory is missing -- ${missing[*]}" + exit 3 +fi + +echo "OK: localisation caches are up-to-date." +exit 0 diff --git a/files/misc/l10nupdate/l10nupdate-1 b/files/misc/l10nupdate/l10nupdate-1 index 95c939b..2328b8f 100755 --- a/files/misc/l10nupdate/l10nupdate-1 +++ b/files/misc/l10nupdate/l10nupdate-1 @@ -23,8 +23,7 @@ then echo "Updated $path" else - $BINDIR/dologmsg "!log LocalisationUpdate failed: git pull of $path failed" - echo "Updating $path FAILED." + echo "LocalisationUpdate failed: git pull of $path failed" exit 1 fi else @@ -36,8 +35,7 @@ then echo "Cloned $path" else - $BINDIR/dologmsg "!log LocalisationUpdate failed: git clone of $path failed" - echo "Cloning $path FAILED." + echo "LocalisationUpdate failed: git clone of $path failed" exit 1 fi fi @@ -47,8 +45,7 @@ # Get all MW message cache versions (and a wiki DB name for each) mwVerDbSets=$($BINDIR/mwversionsinuse --extended --withdb) if [ -z "$mwVerDbSets" ]; then - $BINDIR/dologmsg "!log LocalisationUpdate failed: mwversionsinuse returned empty list" - echo "Obtaining MediaWiki version list FAILED" + echo "LocalisationUpdate failed: mwversionsinuse returned empty list" exit 1 fi @@ -79,11 +76,9 @@ cp --preserve=timestamps --force /var/lib/l10nupdate/cache-"$mwVerNum"/l10n_cache-* $MW_COMMON_SOURCE/php-"$mwVerNum"/cache/l10n echo "Syncing to Apaches" $BINDIR/sync-l10nupdate-1 "$mwVerNum" - $BINDIR/dologmsg "!log LocalisationUpdate completed ($mwVerNum) at `date`" - echo "All done" + echo "LocalisationUpdate completed ($mwVerNum) at `date`" else - $BINDIR/dologmsg "!log LocalisationUpdate failed ($mwVerNum) at `date`" - echo "FAILED" + echo "LocalisationUpdate failed ($mwVerNum) at `date`" fi done @@ -93,5 +88,4 @@ for wiki in `<"$ALLDB"`; do /usr/local/bin/mwscript extensions/WikimediaMaintenance/refreshMessageBlobs.php --wiki="$wiki" done -echo "All done" -$BINDIR/dologmsg "!log LocalisationUpdate ResourceLoader cache refresh completed at `date`" +echo "LocalisationUpdate ResourceLoader cache refresh completed at `date`" diff --git a/manifests/misc/deployment.pp b/manifests/misc/deployment.pp index 04f71ae..7742062 100644 --- a/manifests/misc/deployment.pp +++ b/manifests/misc/deployment.pp @@ -301,6 +301,19 @@ ensure => present; } + file { '/usr/lib/nagios/plugins/check_l10n_cache': + source => 'puppet:///files/icinga/check_l10n_cache', + mode => '0755', + } + + nrpe::monitor_service { 'l10nupdate': + ensure => 'present', + description => 'Ensure localisation caches are up-to-date', + nrpe_command => '/usr/lib/nagios/plugins/check_l10n_cache', + require => File['/usr/lib/nagios/plugins/check_l10n_cache'], + contact_group => 'admins', + } + file { "${scriptpath}/l10nupdate": owner => root, diff --git a/templates/icinga/checkcommands.cfg.erb b/templates/icinga/checkcommands.cfg.erb index 259022c..47c3c84 100644 --- a/templates/icinga/checkcommands.cfg.erb +++ b/templates/icinga/checkcommands.cfg.erb @@ -418,6 +418,10 @@ command_name check_eventlogging_jobs command_line /usr/lib/nagios/plugins/check_eventlogging_jobs } +define command{ + command_name check_l10n_cache + command_line /usr/lib/nagios/plugins/check_l10n_cache +} #Generic NRPE check -- To view, visit https://gerrit.wikimedia.org/r/88009 To unsubscribe, visit https://gerrit.wikimedia.org/r/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I6eca6c063319a2663bb25d76a92709103c9dd88a Gerrit-PatchSet: 1 Gerrit-Project: operations/puppet Gerrit-Branch: production Gerrit-Owner: Ori.livneh <o...@wikimedia.org> _______________________________________________ MediaWiki-commits mailing list MediaWiki-commits@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits