On Mon, Aug 22, 2016 at 12:18 PM, Attila Megyeri <amegy...@minerva-soft.com> wrote: > Dear community, > > > > A few days ago we had an issue in our Mysql M/S replication cluster. > > We have a one R/W Master, and a one RO Slave setup. RO VIP is supposed to be > running on the slave if it is not too much behind the master, and if any > error occurs, RO VIP is moved to the master. > > > > Something happened with the slave Mysql (some disk issue, still > investigating), but the problem is, that the slave VIP remained on the slave > device, even though the slave process was not running, and the server was > much outdated. > > > > During the issue the following log entries appeared (just an extract as it > would be too long): > > > > > > Aug 20 02:04:07 ctdb1 corosync[1056]: [MAIN ] Corosync main process was > not scheduled for 14088.5488 ms (threshold is 4000.0000 ms). Consider token > timeout increase. > > Aug 20 02:04:07 ctdb1 corosync[1056]: [TOTEM ] A processor failed, forming > new configuration. > > Aug 20 02:04:34 ctdb1 corosync[1056]: [MAIN ] Corosync main process was > not scheduled for 27065.2559 ms (threshold is 4000.0000 ms). Consider token > timeout increase. > > Aug 20 02:04:34 ctdb1 corosync[1056]: [TOTEM ] A new membership (xxx:6720) > was formed. Members left: 168362243 168362281 168362282 168362301 168362302 > 168362311 168362312 1 > > Aug 20 02:04:34 ctdb1 corosync[1056]: [TOTEM ] A new membership (xxx:6724) > was formed. Members > > .. > > Aug 20 02:13:28 ctdb1 corosync[1056]: [MAIN ] Completed service > synchronization, ready to provide service. > > .. > > Aug 20 02:13:29 ctdb1 attrd[1584]: notice: attrd_trigger_update: Sending > flush op to all hosts for: readable (1) > > … > > Aug 20 02:13:32 ctdb1 mysql(db-mysql)[10492]: INFO: post-demote notification > for ctdb1 > > Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-master)[10490]: INFO: IP status = ok, > IP_CIP= > > Aug 20 02:13:32 ctdb1 crmd[1586]: notice: process_lrm_event: LRM operation > db-ip-master_stop_0 (call=371, rc=0, cib-update=179, confirmed=true) ok > > Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO: Adding inet address > xxx/24 with broadcast address xxxx to device eth0 > > Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO: Bringing device > eth0 up > > Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO: > /usr/lib/heartbeat/send_arp -i 200 -r 5 -p > /usr/var/run/resource-agents/send_arp-xxx eth0 xxx auto not_used not_used > > Aug 20 02:13:32 ctdb1 crmd[1586]: notice: process_lrm_event: LRM operation > db-ip-slave_start_0 (call=377, rc=0, cib-update=180, confirmed=true) ok > > Aug 20 02:13:32 ctdb1 crmd[1586]: notice: process_lrm_event: LRM operation > db-ip-slave_monitor_20000 (call=380, rc=0, cib-update=181, confirmed=false) > ok > > Aug 20 02:13:32 ctdb1 crmd[1586]: notice: process_lrm_event: LRM operation > db-mysql_notify_0 (call=374, rc=0, cib-update=0, confirmed=true) ok > > Aug 20 02:13:32 ctdb1 attrd[1584]: notice: attrd_trigger_update: Sending > flush op to all hosts for: master-db-mysql (1) > > Aug 20 02:13:32 ctdb1 attrd[1584]: notice: attrd_perform_update: Sent > update 1622: master-db-mysql=1 > > Aug 20 02:13:32 ctdb1 crmd[1586]: notice: process_lrm_event: LRM operation > db-mysql_demote_0 (call=384, rc=0, cib-update=182, confirmed=true) ok > > Aug 20 02:13:33 ctdb1 mysql(db-mysql)[11160]: INFO: Ignoring post-demote > notification for my own demotion. > > Aug 20 02:13:33 ctdb1 crmd[1586]: notice: process_lrm_event: LRM operation > db-mysql_notify_0 (call=387, rc=0, cib-update=0, confirmed=true) ok > > Aug 20 02:13:33 ctdb1 mysql(db-mysql)[11185]: ERROR: check_slave invoked on > an instance that is not a replication slave. > > Aug 20 02:13:33 ctdb1 crmd[1586]: notice: process_lrm_event: LRM operation > db-mysql_monitor_7000 (call=390, rc=0, cib-update=183, confirmed=false) ok > > Aug 20 02:13:33 ctdb1 ntpd[1560]: Listen normally on 16 eth0 xxxx. UDP 123 > > Aug 20 02:13:33 ctdb1 ntpd[1560]: Deleting interface #12 eth0, xxx#123, > interface stats: received=0, sent=0, dropped=0, active_time=2637334 secs > > Aug 20 02:13:33 ctdb1 ntpd[1560]: peers refreshed > > Aug 20 02:13:33 ctdb1 ntpd[1560]: new interface(s) found: waking up resolver > > Aug 20 02:13:40 ctdb1 mysql(db-mysql)[11224]: ERROR: check_slave invoked on > an instance that is not a replication slave. > > Aug 20 02:13:47 ctdb1 mysql(db-mysql)[11263]: ERROR: check_slave invoked on > an instance that is not a replication slave. > > > > And from this time, the last two lines repeat every 7 seconds (mysql > monitoring interval) > > > > > > The expected behavior was that the slave (RO) VIP should have been moved to > the master, as the secondary db was outdated. > > Unfortunately I cannot recall what crm_mon was showing when the issue was > present, but I am sure that the RA did not handle the situation properly. > > > > Placing the slave node into standby and the online resolved the issue > immediately (Slave started to sync, and in a few minutes it catched up the > master). > > > > > > Here is the relevant config from the configuration: > > > > > > primitive db-ip-master ocf:heartbeat:IPaddr2 \ > > params lvs_support="true" ip="XXX" cidr_netmask="24" > broadcast="XXX" \ > > op start interval="0" timeout="20s" on-fail="restart" \ > > op monitor interval="20s" timeout="20s" \ > > op stop interval="0" timeout="20s" on-fail="block" > > > > primitive db-ip-slave ocf:heartbeat:IPaddr2 \ > > params lvs_support="true" ip="XXX" cidr_netmask="24" > broadcast="XXX" \ > > op start interval="0" timeout="20s" on-fail="restart" \ > > op monitor interval="20s" timeout="20s" \ > > op stop interval="0" timeout="20s" on-fail="block" \ > > meta target-role="Started" > > > > primitive db-mysql ocf:heartbeat:mysql \ > > params binary="/usr/bin/mysqld_safe" > config="/etc/mysql/my.cnf" datadir="/var/lib/mysql" user="mysql" > pid="/var/run/mysqld/mysqld.pid" socket="/var/run/mysqld/mysqld.sock" > test_passwd="XXX" test_table="XXX" test_user="XXX" replication_user="XXX" > replication_passwd="XXX" additional_parameters="--skip-slave-start" \ > > op start interval="0" timeout="240s" on-fail="restart" \ > > op stop interval="0" timeout="120s" on-fail="block" \ > > op monitor interval="7" timeout="30s" on-fail="restart" > OCF_CHECK_LEVEL="1" \ > > op promote interval="0" timeout="120" on-fail="restart" \ > > op demote interval="0" timeout="120" on-fail="block" > > > > ms mysql db-mysql \ > > meta notify="true" master-max="1" clone-max="2" > target-role="Started" is-managed="true" > > > > location db-ip-m-1 db-ip-master 0: ctdb1 > > location db-ip-m-2 db-ip-master 0: ctdb2 > > location db-ip-s-1 db-ip-slave 0: ctdb1 > > location db-ip-s-2 db-ip-slave 0: ctdb2 > > location db-ip-s-readable db-ip-slave \ > > rule $id="rule-no-reader-slave" -inf: readable lt 1 >
How long did you wait? Conditions are reevaluated every cluster-recheck-interval which is 15 minutes by default. > location db-mysql-loc-1 mysql 100: ctdb1 > > location db-mysql-loc-2 mysql 100: ctdb2 > > > > colocation db-ip-slave-master -50: db-ip-slave db-ip-master > > colocation db-ip-with-master inf: db-ip-master mysql:Master > > colocation db-slave-on-db inf: db-ip-slave mysql > > order master-after-db inf: mysql db-ip-master > > order slave-after-db inf: mysql db-ip-slave > > > > property $id="cib-bootstrap-options" \ > > dc-version="1.1.10-42f2063" \ > > cluster-infrastructure="corosync" \ > > symmetric-cluster="false" \ > > cluster-recheck-interval="2m" \ > > no-quorum-policy="stop" \ > > stop-orphan-resources="false" \ > > start-failure-is-fatal="false" \ > > maintenance-mode="false" > > property $id="mysql_replication" \ > > db-mysql_REPL_INFO="ctdb2|mysql-bin.002928|107" > > rsc_defaults $id="rsc-options" \ > > resource-stickiness="0" > > > > > > > > Do you have any hints what could have gone wrong, and how we could avoid > such issues in the future? > > > > Versions: > > Ubuntu Trusty Tahr > > Pacemaker 1.1.10 > > Corosync 2.3.3 > > Resource agents 3.9.3 > > > > > > Thanks a lot in advance, > > > > Attila > > > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org