Giuseppe Lavagetto has uploaded a new change for review. ( 
https://gerrit.wikimedia.org/r/357193 )

Change subject: role:jobqueue_redis: daily restart of slaves
......................................................................

role:jobqueue_redis: daily restart of slaves

As detailed in T163337, the jobqueue redis replication is essentially
broken and objects in the master and slave servers will diverge over
time dramatically. So begin by restarting all slave instances daily to
force a full resync and limit the amount of divergence to a 1-day
interval.

This is by no means a perfect solution, but it's done in the idea of
limiting the damage in case we lose the main DC suddenly and we have to
switch to the other one without having the time or possibility to do a
full resync.

Change-Id: I5e432cdc46e1dfdae9d007f1bb7324810418fa08
---
A modules/profile/files/redis/restart-redis-if-slave.sh
A modules/profile/manifests/redis/jobqueue.pp
M modules/role/manifests/jobqueue_redis/master.pp
3 files changed, 35 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/operations/puppet 
refs/changes/93/357193/1

diff --git a/modules/profile/files/redis/restart-redis-if-slave.sh 
b/modules/profile/files/redis/restart-redis-if-slave.sh
new file mode 100755
index 0000000..9627146
--- /dev/null
+++ b/modules/profile/files/redis/restart-redis-if-slave.sh
@@ -0,0 +1,12 @@
+#!/bin/bash
+set -e
+
+# Check if currently a slave
+for instance in "$@";
+do
+    _config="/etc/redis/tcp_${instance}.conf"
+    authpass=$(awk '{if ($1 == "requirepass") print $2}' "$_config")
+    if redis-cli -h 127.0.0.1 -p "$instance" -a "$authpass" INFO replication | 
grep -q role:slave; then
+        systemctl restart "redis-instance-tcp_${instance}.service"
+    fi
+done
diff --git a/modules/profile/manifests/redis/jobqueue.pp 
b/modules/profile/manifests/redis/jobqueue.pp
new file mode 100644
index 0000000..f6cb081
--- /dev/null
+++ b/modules/profile/manifests/redis/jobqueue.pp
@@ -0,0 +1,22 @@
+# Very simple profile for redis for the MW jobqueue. It works as an addition to
+# profile::redis::multidc
+# This is basically to cope with issues described in
+# https://phabricator.wikimedia.org/T163337 with a ugly workaround: restart
+# periodically the redis slaves in order to force a
+# service restart
+class profile::redis::jobqueue {
+    require ::profile::redis::multidc
+    file { '/usr/local/bin/restart-redis-if-slave':
+        ensure => present,
+        source => 'puppet:///modules/profile/redis/restart-redis-if-slave.sh',
+        mode   => '0555',
+        owner  => 'root',
+        group  => 'root',
+    }
+
+    cron { 'jobqueue-redis-conditional-restart':
+        command => "/usr/local/bin/restart-redis-if-slave 
${::profile::redis::multidc::instances}",
+        hour    => 1,
+        minute  => 0,
+    }
+}
diff --git a/modules/role/manifests/jobqueue_redis/master.pp 
b/modules/role/manifests/jobqueue_redis/master.pp
index aa38b82..95428f0 100644
--- a/modules/role/manifests/jobqueue_redis/master.pp
+++ b/modules/role/manifests/jobqueue_redis/master.pp
@@ -2,6 +2,7 @@
     include ::standard
     include ::base::firewall
     include ::profile::redis::multidc
+    include ::profile::redis::jobqueue
 
     system::role { 'role::jobqueue_redis::master':
         description => 'Jobqueue master',

-- 
To view, visit https://gerrit.wikimedia.org/r/357193
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I5e432cdc46e1dfdae9d007f1bb7324810418fa08
Gerrit-PatchSet: 1
Gerrit-Project: operations/puppet
Gerrit-Branch: production
Gerrit-Owner: Giuseppe Lavagetto <[email protected]>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to