Ori.livneh has uploaded a new change for review.

  https://gerrit.wikimedia.org/r/88009


Change subject: Add Icinga check for l10nupdate & drop !log-based alerts
......................................................................

Add Icinga check for l10nupdate & drop !log-based alerts

Localisation update is the only job that uses !log to issue alerts, to my
knowledge, and it does so indiscriminantly, spamming the server admin log with
both successes and failures. It's not clear to me that anyone takes its
failures very seriously, either. (The last run failed, for example.)

This patch adds an Icinga plug-in that checks the status of the localisation
cache. It issues a WARNING if the caches are >26 hours old, and a CRITICAL if
over >50. The numbers were chosen because l10nupdate runs once a day and takes
some arbitrary fraction of an hour to complete, and because failures are (in my
anecdotal experience) typically given another shot to self-correct before being
debugged.

With the Icinga check in effect, it would not be necessary (or appropriate) for
l10nupdate to use the SAL, so this patch also updates l10nupdate-1 to echo
verbose log messages to standard out instead. (The l10nupdate cron job
redirects stdout to a log file.)

Change-Id: I6eca6c063319a2663bb25d76a92709103c9dd88a
---
A files/icinga/check_l10n_cache
M files/misc/l10nupdate/l10nupdate-1
M manifests/misc/deployment.pp
M templates/icinga/checkcommands.cfg.erb
4 files changed, 76 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/operations/puppet 
refs/changes/09/88009/1

diff --git a/files/icinga/check_l10n_cache b/files/icinga/check_l10n_cache
new file mode 100755
index 0000000..28dc285
--- /dev/null
+++ b/files/icinga/check_l10n_cache
@@ -0,0 +1,53 @@
+#!/bin/bash
+#
+# Icinga plug-in for Wikimedia's MediaWiki localisation cache.
+#
+# This check should run on the host that runs l10nupdate. It ensures
+# that an up-to-date localisation cache directory exists for each
+# deployed version of MediaWiki. The check will report
+#
+#  OK       - If localisation caches are up-to-date.
+#  WARNING  - If the l10n cache files are more than 26 hours old.
+#  CRITICAL - If the l10n cache files are more than 50 hours old.
+#  UNKNOWN  - If no MediaWiki versions are in use.
+#  UNKNOWN  - If no l10n cache directory exists for a version.
+#
+. /usr/local/lib/mw-deployment-vars.sh
+. $MW_COMMON_SOURCE/multiversion/MWRealm.sh
+
+versions=($(/usr/local/bin/mwversionsinuse))
+
+if [ -z "$versions" ]; then
+    echo "UNKNOWN: mwversionsinuse returned an empty list"
+    exit 3
+fi
+
+missing=()
+critical=()
+warning=()
+
+for version in "${versions[@]}"; do
+    l10n_dir="${MW_COMMON_SOURCE}/php-${version}/cache/l10n"
+    if [ ! -d "$l10n_dir" ]; then
+        missing+=("$version")
+    elif [ ! `find "${l10n_dir}" -mtime -2.1 -print -quit` ]; then
+        critical+=("$version")
+    elif [ ! `find "${l10n_dir}" -mtime -1.1 -print -quit` ]; then
+        warning+=("$version")
+    fi
+done
+
+IFS=,
+if [ -n "$critical" ]; then
+    echo "CRITICAL: localisation cache is more than 50 hours old -- 
${critical[*]}"
+    exit 2
+elif [ -n "$warning" ]; then
+    echo "WARNING: localisation cache is more than 50 hours old -- 
${warning[*]}"
+    exit 1
+elif [ -n "$missing" ]; then
+    echo "UNKNOWN: localisation cache directory is missing -- ${missing[*]}"
+    exit 3
+fi
+
+echo "OK: localisation caches are up-to-date."
+exit 0
diff --git a/files/misc/l10nupdate/l10nupdate-1 
b/files/misc/l10nupdate/l10nupdate-1
index 95c939b..2328b8f 100755
--- a/files/misc/l10nupdate/l10nupdate-1
+++ b/files/misc/l10nupdate/l10nupdate-1
@@ -23,8 +23,7 @@
                then
                        echo "Updated $path"
                else
-                       $BINDIR/dologmsg "!log LocalisationUpdate failed: git 
pull of $path failed"
-                       echo "Updating $path FAILED."
+                       echo "LocalisationUpdate failed: git pull of $path 
failed"
                        exit 1
                fi
        else
@@ -36,8 +35,7 @@
                then
                        echo "Cloned $path"
                else
-                       $BINDIR/dologmsg "!log LocalisationUpdate failed: git 
clone of $path failed"
-                       echo "Cloning $path FAILED."
+                       echo "LocalisationUpdate failed: git clone of $path 
failed"
                        exit 1
                fi
        fi
@@ -47,8 +45,7 @@
 # Get all MW message cache versions (and a wiki DB name for each)
 mwVerDbSets=$($BINDIR/mwversionsinuse --extended --withdb)
 if [ -z "$mwVerDbSets" ]; then
-       $BINDIR/dologmsg "!log LocalisationUpdate failed: mwversionsinuse 
returned empty list"
-       echo "Obtaining MediaWiki version list FAILED"
+       echo "LocalisationUpdate failed: mwversionsinuse returned empty list"
        exit 1
 fi
 
@@ -79,11 +76,9 @@
                cp --preserve=timestamps --force 
/var/lib/l10nupdate/cache-"$mwVerNum"/l10n_cache-* 
$MW_COMMON_SOURCE/php-"$mwVerNum"/cache/l10n
                echo "Syncing to Apaches"
                $BINDIR/sync-l10nupdate-1 "$mwVerNum"
-               $BINDIR/dologmsg "!log LocalisationUpdate completed ($mwVerNum) 
at `date`"
-               echo "All done"
+               echo "LocalisationUpdate completed ($mwVerNum) at `date`"
        else
-               $BINDIR/dologmsg "!log LocalisationUpdate failed ($mwVerNum) at 
`date`"
-               echo "FAILED"
+               echo "LocalisationUpdate failed ($mwVerNum) at `date`"
        fi
 done
 
@@ -93,5 +88,4 @@
 for wiki in `<"$ALLDB"`; do
        /usr/local/bin/mwscript 
extensions/WikimediaMaintenance/refreshMessageBlobs.php --wiki="$wiki"
 done
-echo "All done"
-$BINDIR/dologmsg "!log LocalisationUpdate ResourceLoader cache refresh 
completed at `date`"
+echo "LocalisationUpdate ResourceLoader cache refresh completed at `date`"
diff --git a/manifests/misc/deployment.pp b/manifests/misc/deployment.pp
index 04f71ae..7742062 100644
--- a/manifests/misc/deployment.pp
+++ b/manifests/misc/deployment.pp
@@ -301,6 +301,19 @@
                ensure => present;
        }
 
+    file { '/usr/lib/nagios/plugins/check_l10n_cache':
+        source => 'puppet:///files/icinga/check_l10n_cache',
+        mode   => '0755',
+    }
+
+    nrpe::monitor_service { 'l10nupdate':
+        ensure        => 'present',
+        description   => 'Ensure localisation caches are up-to-date',
+        nrpe_command  => '/usr/lib/nagios/plugins/check_l10n_cache',
+        require       => File['/usr/lib/nagios/plugins/check_l10n_cache'],
+        contact_group => 'admins',
+    }
+
        file {
                "${scriptpath}/l10nupdate":
                        owner => root,
diff --git a/templates/icinga/checkcommands.cfg.erb 
b/templates/icinga/checkcommands.cfg.erb
index 259022c..47c3c84 100644
--- a/templates/icinga/checkcommands.cfg.erb
+++ b/templates/icinga/checkcommands.cfg.erb
@@ -418,6 +418,10 @@
        command_name    check_eventlogging_jobs
        command_line    /usr/lib/nagios/plugins/check_eventlogging_jobs
 }
+define command{
+        command_name   check_l10n_cache
+        command_line   /usr/lib/nagios/plugins/check_l10n_cache
+}
 
 
 #Generic NRPE check

-- 
To view, visit https://gerrit.wikimedia.org/r/88009
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I6eca6c063319a2663bb25d76a92709103c9dd88a
Gerrit-PatchSet: 1
Gerrit-Project: operations/puppet
Gerrit-Branch: production
Gerrit-Owner: Ori.livneh <o...@wikimedia.org>

_______________________________________________
MediaWiki-commits mailing list
MediaWiki-commits@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to