jcrespo added a comment.
$ check_mariadb.py -h db1052 --slave-status --primary-dc=eqiad
{"datetime": 1501777331.898183, "ssl_expiration": 1619276854.0, "connection": "ok", "connection_latency": 0.07626748085021973, "ssl": true, "total_queries": 15981662418, "heartbeat": {"s1": 0.400536}, "uptime":
gerritbot added a comment.
Change 369397 merged by Jcrespo:
[operations/puppet@production] mariadb: Add new python3 script to check the health of a server
https://gerrit.wikimedia.org/r/369397TASK DETAILhttps://phabricator.wikimedia.org/T171928EMAIL PREFERENCEShttps://phabricator.wikimedia.org/set
jcrespo added a comment.
Wikidata goes into read-only the subscriptions mentioned
Yes, definitely some extensions in the past do not behave perfectly and do not respect mediawiki's read-only mode- I do not know what is the sate of Wikidata, but for what you say, a ticket should be filed so its sta
gerritbot added a comment.
Change 369397 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Add new python3 script to check the health of a server
https://gerrit.wikimedia.org/r/369397TASK DETAILhttps://phabricator.wikimedia.org/T171928EMAIL PREF
jcrespo added a comment.
I have started working on more complete monitoring, useful if we go over the route of human monitoring rather than automation, here is one example:
$ ./check_mariadb.py --icinga -h db1052.eqiad.wmnet --check_read_only=0
Version 10.0.28-MariaDB, Uptime 16295390s, read_only:
mark added a comment.
I agree; there's a very good reason for setting masters to read-only when something happened, because it needs manual intervention to investigate whether it's safe to go read-write again. Any automation to do that should be REALLY thoroughly thought through, covering all cases
Marostegui added a comment.
From my side, I would prefer option "b" (monitoring read-only status on the active masters)
My reasoning for this is:
I wouldn't like puppet to automatically change settings, specially on the masters. And if a master crashes, I want to investigate why it crashed (in cas
jcrespo added a comment.
I've almost finished the above incident documentation. However, I am unsure about which are the right actionables and their priorities (last section).
let's use this ticket to agree on what would be the best followup, a) making puppet change read-only state of the db serv