Re: [ClusterLabs] Questions about pacemaker/ mysql resource agent behaviour when network fail

Ken Gaillot Wed, 10 Oct 2018 09:19:47 -0700

On Wed, 2018-10-10 at 12:18 +0200, Simon Bomm wrote:
> Le sam. 6 oct. 2018 à 06:13, Andrei Borzenkov <arvidj...@gmail.com> a
> écrit :
> > 05.10.2018 15:00, Simon Bomm пишет:
> > > Hi all,
> > > 
> > > Using pacemaker 1.1.18-11 and mysql resource agent (
> > > https://github.com/ClusterLabs/resource-agents/blob/RHEL6/heartbe
> > at/mysql),
> > > I run into an unwanted behaviour. My point of view of course,
> > maybe it's
> > > expected to be as it is that's why I ask.
> > > 
> > > # My test case is the following :
> > > 
> > > Everything is OK on my cluster, crm_mon output is as below (no
> > failed
> > > actions)
> > > 
> > >  Master/Slave Set: ms_mysql-master [ms_mysql]
> > >      Masters: [ db-master ]
> > >      Slaves: [ db-slave ]
> > > 
> > > 1. I insert in a table on master, no issue data is replicated.
> > > 2. I shut down net int on the master (vm),
> > 
> 
> First, thanks for taking time to answer me
>  
> > What exactly does it mean? How do you shut down net?
> > 
> 
> Disconnect the network card from VMWare vSphere Console
>  
> > > pacemaker correctly start on the
> > > other node. Master is seen as offline, and db-slave is now master
> > > 
> > >  Master/Slave Set: ms_mysql-master [ms_mysql]
> > >      Masters: [ db-slave ]
> > > 
> > > 3. I bring back my net int up, pacemaker see the node online and
> > set the
> > > old-master as a the new slave :
> > > 
> > >  Master/Slave Set: ms_mysql-master [ms_mysql]
> > >      Masters: [ db-slave ]
> > >      Slaves: [ db-master ]
> > > 
> > > 4. From this point, my external monitoring bash script shows that
> > SQL and
> > > IO thread are not running, but I can't see any error in the pcs
> > > status/crm_mon outputs.
> > 
> > Pacemaker just shows what resource agents claim. If resource agent
> > claims resource is started, there is nothing pacemaker can do. You
> > need
> > to debug what resource agent does.
> > 
> 
> I've debugged it quite a lot, and that's what drove me to isolate
> error below : 
> 
> > mysql -h localhost -u user-repl -pmysqlreplpw -e "START SLAVE"
> > ERROR 1200 (HY000) at line 1: Misconfigured slave: MASTER_HOST was
> not set;
> > Fix in config file or with CHANGE MASTER TO  
> 
>  
> > > Consequence is that I continue inserting on my new
> > > promoted master but the data is never consumed by my former
> > master computer.
> > > 
> > > # Questions :
> > > 
> > > - Is this some kind of safety behaviour to avoid data corruption
> > when a
> > > node is back online ?
> > > - When I want to manually start it like ocf does it returns this
> > error :
> > > 
> > > mysql -h localhost -u user-repl -pmysqlreplpw -e "START SLAVE"
> > > ERROR 1200 (HY000) at line 1: Misconfigured slave: MASTER_HOST
> > was not set;
> > > Fix in config file or with CHANGE MASTER TO
> > > 
> > > - I would expect the cluster to stop the slave and show a failed
> > action, am
> > > I wrong here ?
> > > 
> > 
> > I am not familiar with specific application and its structure. From
> > quick browsing monitor action does mostly check for running
> > process. Is
> > mySQL process running?
> 
> Yes it is, as you mentionned previously the config wants pacemaker to
> start mysql resource so no problems.
> > > # Other details (not sure it matters a lot)
> > > 
> > > No stonith enabled, no fencing or auto-failback.
> > 
> > How are you going to resolve split-brain without stonith? "Stopping
> > net"
> > sounds exactly like split brain, in which case further
> > investigation is
> > rather pointless.
> > 
> 
> You make the point, as I'm not very familiar with stonithd, I first
> disable this to avoid unwanted behaviour but I'll definitely follow
> your advise and dig around.
>  
> > Anyway, to give some non-hypothetical answer full configuration and
> > logs
> > from both systems are needed.
> > 
> 
> Sure, please find the full configuration 
> 
> Cluster Name: app_cluster
> Corosync Nodes:
>  app-central-master app-central-slave app-db-master app-db-slave app-
> quorum
> Pacemaker Nodes:
>  app-central-master app-central-slave app-db-master app-db-slave app-
> quorum
> 
> Resources:
>  Master: ms_mysql-master
>   Meta Attrs: master-node-max=1 clone_max=2 globally-unique=false
> clone-node-max=1 notify=true
>   Resource: ms_mysql (class=ocf provider=heartbeat type=mysql-app)
>    Attributes: binary=/usr/bin/mysqld_safe
> config=/etc/my.cnf.d/server.cnf datadir=/var/lib/mysql
> evict_outdated_slaves=false max_slave_lag=15
> pid=/var/lib/mysql/mysql.pid replication_passwd=mysqlreplpw
> replication_user=app-repl socket=/var/lib/mysql/mysql.sock
> test_passwd=mysqlrootpw test_user=root
>    Operations: demote interval=0s timeout=120 (ms_mysql-demote-
> interval-0s)
>                monitor interval=20 timeout=30 (ms_mysql-monitor-
> interval-20)
>                monitor interval=10 role=Master timeout=30 (ms_mysql-
> monitor-interval-10)
>                monitor interval=30 role=Slave timeout=30 (ms_mysql-
> monitor-interval-30)
>                notify interval=0s timeout=90 (ms_mysql-notify-
> interval-0s)
>                promote interval=0s timeout=120 (ms_mysql-promote-
> interval-0s)
>                start interval=0s timeout=120 (ms_mysql-start-
> interval-0s)
>                stop interval=0s timeout=120 (ms_mysql-stop-interval-
> 0s)
>  Resource: vip_mysql (class=ocf provider=heartbeat type=IPaddr2-app)
>   Attributes: broadcast=10.30.255.255 cidr_netmask=16
> flush_routes=true ip=10.30.3.229 nic=ens160
>   Operations: monitor interval=10s timeout=20s (vip_mysql-monitor-
> interval-10s)
>               start interval=0s timeout=20s (vip_mysql-start-
> interval-0s)
>               stop interval=0s timeout=20s (vip_mysql-stop-interval-
> 0s)
>  Group: app
>   Resource: misc_app (class=ocf provider=heartbeat type=misc-app)
>    Attributes: crondir=/etc/app-failover/resources/cron/,/etc/cron.d/
>    Meta Attrs: target-role=started
>    Operations: monitor interval=5s timeout=20s (misc_app-monitor-
> interval-5s)
>                start interval=0s timeout=20s (misc_app-start-
> interval-0s)
>                stop interval=0s timeout=20s (misc_app-stop-interval-
> 0s)
>   Resource: cbd_central_broker (class=ocf provider=heartbeat
> type=cbd-central-broker)
>    Meta Attrs: target-role=started
>    Operations: monitor interval=5s timeout=20s (cbd_central_broker-
> monitor-interval-5s)
>                start interval=0s timeout=90s (cbd_central_broker-
> start-interval-0s)
>                stop interval=0s timeout=90s (cbd_central_broker-stop-
> interval-0s)
>   Resource: centcore (class=ocf provider=heartbeat type=centcore)
>    Meta Attrs: target-role=started
>    Operations: monitor interval=5s timeout=20s (centcore-monitor-
> interval-5s)
>                start interval=0s timeout=90s (centcore-start-
> interval-0s)
>                stop interval=0s timeout=90s (centcore-stop-interval-
> 0s)
>   Resource: apptrapd (class=ocf provider=heartbeat type=apptrapd)
>    Meta Attrs: target-role=started
>    Operations: monitor interval=5s timeout=20s (apptrapd-monitor-
> interval-5s)
>                start interval=0s timeout=90s (apptrapd-start-
> interval-0s)
>                stop interval=0s timeout=90s (apptrapd-stop-interval-
> 0s)
>   Resource: app_central_sync (class=ocf provider=heartbeat type=app-
> central-sync)
>    Meta Attrs: target-role=started
>    Operations: monitor interval=5s timeout=20s (app_central_sync-
> monitor-interval-5s)
>                start interval=0s timeout=90s (app_central_sync-start-
> interval-0s)
>                stop interval=0s timeout=90s (app_central_sync-stop-
> interval-0s)
>   Resource: snmptrapd (class=ocf provider=heartbeat type=snmptrapd)
>    Meta Attrs: target-role=started
>    Operations: monitor interval=5s timeout=20s (snmptrapd-monitor-
> interval-5s)
>                start interval=0s timeout=90s (snmptrapd-start-
> interval-0s)
>                stop interval=0s timeout=90s (snmptrapd-stop-interval-
> 0s)
>   Resource: http (class=ocf provider=heartbeat type=apacheapp)
>    Meta Attrs: target-role=started
>    Operations: monitor interval=5s timeout=20s (http-monitor-
> interval-5s)
>                start interval=0s timeout=40s (http-start-interval-0s)
>                stop interval=0s timeout=60s (http-stop-interval-0s)
>   Resource: vip_app (class=ocf provider=heartbeat type=IPaddr2-app)
>    Attributes: broadcast=10.30.255.255 cidr_netmask=16
> flush_routes=true ip=10.30.3.230 nic=ens160
>    Meta Attrs: target-role=started
>    Operations: monitor interval=10s timeout=20s (vip_app-monitor-
> interval-10s)
>                start interval=0s timeout=20s (vip_app-start-interval-
> 0s)
>                stop interval=0s timeout=20s (vip_app-stop-interval-
> 0s)
>   Resource: centengine (class=ocf provider=heartbeat type=centengine)
>    Meta Attrs: multiple-active=stop_start target-role=started
>    Operations: monitor interval=5s timeout=20s (centengine-monitor-
> interval-5s)
>                start interval=0s timeout=90s (centengine-start-
> interval-0s)
>                stop interval=0s timeout=90s (centengine-stop-
> interval-0s)
> 
> Stonith Devices:
> Fencing Levels:
> 
> Location Constraints:
>   Resource: app
>     Disabled on: app-db-master (score:-INFINITY) (id:location-app-
> app-db-master--INFINITY)
>     Disabled on: app-db-slave (score:-INFINITY) (id:location-app-app-
> db-slave--INFINITY)
>   Resource: ms_mysql
>     Disabled on: app-central-master (score:-INFINITY) (id:location-
> ms_mysql-app-central-master--INFINITY)
>     Disabled on: app-central-slave (score:-INFINITY) (id:location-
> ms_mysql-app-central-slave--INFINITY)
>   Resource: vip_mysql
>     Disabled on: app-central-master (score:-INFINITY) (id:location-
> vip_mysql-app-central-master--INFINITY)
>     Disabled on: app-central-slave (score:-INFINITY) (id:location-
> vip_mysql-app-central-slave--INFINITY)
> Ordering Constraints:
> Colocation Constraints:
>   vip_mysql with ms_mysql-master (score:INFINITY) (rsc-role:Started)
> (with-rsc-role:Master)
>   ms_mysql-master with vip_mysql (score:INFINITY) (rsc-role:Master)
> (with-rsc-role:Started)
> Ticket Constraints:
> 
> Alerts:
>  No alerts defined
> 
> Resources Defaults:
>  resource-stickiness: INFINITY
> Operations Defaults:
>  No defaults set
> 
> Cluster Properties:
>  cluster-infrastructure: corosync
>  cluster-name: app_cluster
>  dc-version: 1.1.18-11.el7_5.3-2b07d5c5a9
>  have-watchdog: false
>  last-lrm-refresh: 1538740285
>  ms_mysql_REPL_INFO: app-db-master|mysql-bin.000012|327
>  stonith-enabled: false
>  symmetric-cluster: true
> Node Attributes:
>  app-quorum: standby=on
> 
> Quorum:
>   Options:
>   Device:
>     votes: 1
>     Model: net
>       algorithm: ffsplit
>       host: app-quorum
> 
> 
> Logs are below 
> 
> SLAVE when I disconnect interface (node is isolated), and associated
> crm_mon, lgtm and can get the behaviour : 
> 
> Oct 10 09:20:07 app-db-slave corosync[1055]: [TOTEM ] A processor
> failed, forming new configuration.
> Oct 10 09:20:11 app-db-slave corosync[1055]: [TOTEM ] A new
> membership (10.30.3.245:196) was formed. Members left: 3
> Oct 10 09:20:11 app-db-slave corosync[1055]: [TOTEM ] Failed to
> receive the leave message. failed: 3
> Oct 10 09:20:11 app-db-slave corosync[1055]: [QUORUM] Members[4]: 1 2
> 4 5
> Oct 10 09:20:11 app-db-slave corosync[1055]: [MAIN  ] Completed
> service synchronization, ready to provide service.
> Oct 10 09:20:11 app-db-slave cib[1168]:  notice: Node app-db-master
> state is now lost
> Oct 10 09:20:11 app-db-slave attrd[1172]:  notice: Node app-db-master 
> state is now lost
> Oct 10 09:20:11 app-db-slave attrd[1172]:  notice: Removing all app-
> db-master attributes for peer loss
> Oct 10 09:20:11 app-db-slave stonith-ng[1170]:  notice: Node app-db-
> master state is now lost
> Oct 10 09:20:11 app-db-slave pacemakerd[1084]:  notice: Node app-db-
> master state is now lost
> Oct 10 09:20:11 app-db-slave crmd[1175]:  notice: Node app-db-master
> state is now lost
> Oct 10 09:20:11 app-db-slave cib[1168]:  notice: Purged 1 peer with
> id=3 and/or uname=app-db-master from the membership cache
> Oct 10 09:20:11 app-db-slave stonith-ng[1170]:  notice: Purged 1 peer
> with id=3 and/or uname=app-db-master from the membership cache
> Oct 10 09:20:11 app-db-slave attrd[1172]:  notice: Purged 1 peer with
> id=3 and/or uname=app-db-master from the membership cache
> Oct 10 09:20:11 app-db-slave crmd[1175]:  notice: Result of notify
> operation for ms_mysql on app-db-slave: 0 (ok)
> Oct 10 09:20:12 app-db-slave mysql-app(ms_mysql)[21165]: INFO: app-
> db-slave promote is starting
> Oct 10 09:20:12 app-db-slave IPaddr2-app(vip_mysql)[21134]: INFO:
> Adding inet address 10.30.3.229/16 with broadcast address
> 10.30.255.255 to device ens160
> Oct 10 09:20:12 app-db-slave IPaddr2-app(vip_mysql)[21134]: INFO:
> Bringing device ens160 up
> Oct 10 09:20:12 app-db-slave IPaddr2-app(vip_mysql)[21134]: INFO:
> /usr/libexec/heartbeat/send_arp -i 200 -c 5 -I ens160 -s 10.30.3.229
> 10.30.255.255
> Oct 10 09:20:12 app-db-slave crmd[1175]:  notice: Result of start
> operation for vip_mysql on app-db-slave: 0 (ok)
> Oct 10 09:20:12 app-db-slave lrmd[1171]:  notice:
> ms_mysql_promote_0:21165:stderr [ Error performing operation: No such
> device or address ]
> Oct 10 09:20:12 app-db-slave crmd[1175]:  notice: Result of promote
> operation for ms_mysql on app-db-slave: 0 (ok)
> Oct 10 09:20:12 app-db-slave mysql-app(ms_mysql)[21285]: INFO: app-
> db-slave This will be the new master, ignoring post-promote
> notification.
> Oct 10 09:20:12 app-db-slave crmd[1175]:  notice: Result of notify
> operation for ms_mysql on app-db-slave: 0 (ok)
> 
> 
> Node app-quorum: standby
> Online: [ app-central-master app-central-slave app-db-slave ]
> OFFLINE: [ app-db-master ]
> 
> Active resources:
> 
>  Master/Slave Set: ms_mysql-master [ms_mysql]
>      Masters: [ app-db-slave ]
> vip_mysql       (ocf::heartbeat:IPaddr2-app):      Started app-db-
> slave
> 
> And logs from the master during its isolation : 
> 
> Oct 10 09:23:10 app-db-master corosync[1029]: [MAIN  ] Totem is
> unable to form a cluster because of an operating system or network
> fault (reason: totem is continuously in gather state). The most
> common cause of this message is that the local firewall is configured
> improperly.
> Oct 10 09:23:11 app-db-master corosync[1029]: [MAIN  ] Totem is
> unable to form a cluster because of an operating system or network
> fault (reason: totem is continuously in gather state). The most
> common cause of this message is that the local firewall is configured
> improperly.
> Oct 10 09:23:13 app-db-master corosync[1029]: [MAIN  ] Totem is
> unable to form a cluster because of an operating system or network
> fault (reason: totem is continuously in gather state). The most
> common cause of this message is that the local firewall is configured
> improperly.
> Oct 10 09:23:14 app-db-master corosync[1029]: [MAIN  ] Totem is
> unable to form a cluster because of an operating system or network
> fault (reason: totem is continuously in gather state). The most
> common cause of this message is that the local firewall is configured
> improperly.
> Oct 10 09:23:16 app-db-master corosync[1029]: [MAIN  ] Totem is
> unable to form a cluster because of an operating system or network
> fault (reason: totem is continuously in gather state). The most
> common cause of this message is that the local firewall is configured
> improperly.
> Oct 10 09:23:17 app-db-master corosync[1029]: [MAIN  ] Totem is
> unable to form a cluster because of an operating system or network
> fault (reason: totem is continuously in gather state). The most
> common cause of this message is that the local firewall is configured
> improperly.
> Oct 10 09:23:19 app-db-master corosync[1029]: [MAIN  ] Totem is
> unable to form a cluster because of an operating system or network
> fault (reason: totem is continuously in gather state). The most
> common cause of this message is that the local firewall is configured
> improperly.
> Oct 10 09:23:20 app-db-master corosync[1029]: [MAIN  ] Totem is
> unable to form a cluster because of an operating system or network
> fault (reason: totem is continuously in gather state). The most
> common cause of this message is that the local firewall is configured
> improperly.
> Oct 10 09:23:22 app-db-master corosync[1029]: [MAIN  ] Totem is
> unable to form a cluster because of an operating system or network
> fault (reason: totem is continuously in gather state). The most
> common cause of this message is that the local firewall is configured
> improperly.
> Oct 10 09:23:23 app-db-master corosync[1029]: [MAIN  ] Totem is
> unable to form a cluster because of an operating system or network
> fault (reason: totem is continuously in gather state). The most
> common cause of this message is that the local firewall is configured
> improperly.
> Oct 10 09:23:25 app-db-master corosync[1029]: [MAIN  ] Totem is
> unable to form a cluster because of an operating system or network
> fault (reason: totem is continuously in gather state). The most
> common cause of this message is that the local firewall is configured
> improperly.
> Oct 10 09:23:26 app-db-master corosync[1029]: [MAIN  ] Totem is
> unable to form a cluster because of an operating system or network
> fault (reason: totem is continuously in gather state). The most
> common cause of this message is that the local firewall is configured
> improperly.
> Oct 10 09:23:28 app-db-master corosync[1029]: [MAIN  ] Totem is
> unable to form a cluster because of an operating system or network
> fault (reason: totem is continuously in gather state). The most
> common cause of this message is that the local firewall is configured
> improperly.
> Oct 10 09:23:29 app-db-master corosync[1029]: [MAIN  ] Totem is
> unable to form a cluster because of an operating system or network
> fault (reason: totem is continuously in gather state). The most
> common cause of this message is that the local firewall is configured
> improperly.
> Oct 10 09:23:31 app-db-master kernel: vmxnet3 0000:03:00.0 ens160:
> NIC Link is Up 10000 Mbps
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info> 
> [1539156211.1436] device (ens160): carrier: link connected
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info> 
> [1539156211.1444] device (ens160): state change: unavailable ->
> disconnected (reason 'carrier-changed', sys-iface-state: 'managed')
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info> 
> [1539156211.1456] policy: auto-activating connection 'ens160'
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info> 
> [1539156211.1470] device (ens160): Activation: starting connection
> 'ens160' (9fe36e64-13ca-40cb-a174-5b4e16b826f4)
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info> 
> [1539156211.1473] device (ens160): state change: disconnected ->
> prepare (reason 'none', sys-iface-state: 'managed')
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info> 
> [1539156211.1474] manager: NetworkManager state is now CONNECTING
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info> 
> [1539156211.1479] device (ens160): state change: prepare -> config
> (reason 'none', sys-iface-state: 'managed')
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info> 
> [1539156211.1485] device (ens160): state change: config -> ip-config
> (reason 'none', sys-iface-state: 'managed')
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info> 
> [1539156211.2214] device (ens160): state change: ip-config -> ip-
> check (reason 'none', sys-iface-state: 'managed')
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info> 
> [1539156211.2235] device (ens160): state change: ip-check ->
> secondaries (reason 'none', sys-iface-state: 'managed')
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info> 
> [1539156211.2238] device (ens160): state change: secondaries ->
> activated (reason 'none', sys-iface-state: 'managed')
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info> 
> [1539156211.2240] manager: NetworkManager state is now
> CONNECTED_LOCAL
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info> 
> [1539156211.2554] manager: NetworkManager state is now CONNECTED_SITE
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info> 
> [1539156211.2555] policy: set 'ens160' (ens160) as default for IPv4
> routing and DNS
> Oct 10 09:23:31 app-db-master systemd: Starting Network Manager
> Script Dispatcher Service...
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info> 
> [1539156211.2556] device (ens160): Activation: successful, device
> activated.
> Oct 10 09:23:31 app-db-master NetworkManager[692]: <info> 
> [1539156211.2564] manager: NetworkManager state is now
> CONNECTED_GLOBAL
> Oct 10 09:23:31 app-db-master dbus[686]: [system] Activating via
> systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-
> org.freedesktop.nm-dispatcher.service'
> Oct 10 09:23:31 app-db-master dbus[686]: [system] Successfully
> activated service 'org.freedesktop.nm_dispatcher'
> Oct 10 09:23:31 app-db-master systemd: Started Network Manager Script
> Dispatcher Service.
> Oct 10 09:23:31 app-db-master nm-dispatcher: req:1 'up' [ens160]: new
> request (3 scripts)
> Oct 10 09:23:31 app-db-master nm-dispatcher: req:1 'up' [ens160]:
> start running ordered scripts...
> Oct 10 09:23:31 app-db-master nm-dispatcher: req:2 'connectivity-
> change': new request (3 scripts)
> Oct 10 09:23:31 app-db-master nm-dispatcher: req:2 'connectivity-
> change': start running ordered scripts...
> Oct 10 09:23:31 app-db-master corosync[1029]: [MAIN  ] Totem is
> unable to form a cluster because of an operating system or network
> fault (reason: totem is continuously in gather state). The most
> common cause of this message is that the local firewall is configured
> improperly.
> Oct 10 09:23:31 app-db-master corosync[1029]: [TOTEM ] The network
> interface [10.30.3.247] is now up.
> Oct 10 09:23:31 app-db-master corosync[1029]: [TOTEM ] adding new
> UDPU member {10.30.3.245}
> Oct 10 09:23:31 app-db-master corosync[1029]: [TOTEM ] adding new
> UDPU member {10.30.3.246}
> Oct 10 09:23:31 app-db-master corosync[1029]: [TOTEM ] adding new
> UDPU member {10.30.3.247}
> Oct 10 09:23:31 app-db-master corosync[1029]: [TOTEM ] adding new
> UDPU member {10.30.3.248}
> Oct 10 09:23:31 app-db-master corosync[1029]: [TOTEM ] adding new
> UDPU member {10.30.3.249}
> 
> As you can see, node is back online and can communicate again with
> other nodes, so pacemaker start mysql as expected and promote it as
> slave : 
> 
> Node aoo-quorum: standby
> Online: [ app-central-master app-central-slave app-db-master app-db-
> slave ]
> 
> Active resources:
> 
>  Master/Slave Set: ms_mysql-master [ms_mysql]
>      Masters: [ app-db-slave ]
>      Slaves: [ app-db-master ]
> 
> Resource-agents oriented logs are below : 
> 
> Master : 
> Oct 10 09:24:01 app-db-master crmd[5177]:  notice: Result of demote
> operation for ms_mysql on app-db-master: 0 (ok)
> Oct 10 09:24:02 app-db-master mysql-app(ms_mysql)[5592]: INFO: app-
> db-master Ignoring post-demote notification for my own demotion.
> Oct 10 09:24:02 app-db-master crmd[5177]:  notice: Result of notify
> operation for ms_mysql on app-db-master: 0 (ok)
> 
> Slave: 
> 
> Oct 10 09:24:01 app-db-slave crmd[1175]:  notice: Result of notify
> operation for ms_mysql on app-db-slave: 0 (ok)
> Oct 10 09:24:02 app-db-slave mysql-app(ms_mysql)[22969]: INFO: app-
> db-slave Ignoring pre-demote notification execpt for my own demotion.
> Oct 10 09:24:02 app-db-slave crmd[1175]:  notice: Result of notify
> operation for ms_mysql on app-db-slave: 0 (ok)
> Oct 10 09:24:03 app-db-slave mysql-app(ms_mysql)[22999]: INFO: app-
> db-slave post-demote notification for app-db-master.
> Oct 10 09:24:03 app-db-slave mysql-app(ms_mysql)[22999]: WARNING:
> Attempted to unset the replication master on an instance that is not
> configured as a replication slave
> Oct 10 09:24:03 app-db-slave crmd[1175]:  notice: Result of notify
> operation for ms_mysql on app-db-slave: 0 (ok)
> 
> So I expect to have a running replication at this point, but when I
> perform SHOW SLAVE STATUS on my new slave, I get an empty response : 
> 
> MariaDB [(none)]> SHOW SLAVE STATUS \G
> Empty set (0.00 sec)
> 
> MariaDB [(none)]> Ctrl-C -- exit!
> Aborted
> (reverse-i-search)`ab': systemctl en^Cle corosync
> [root@app-db-master ~]# bash /etc/app-failover/mysql-exploit/mysql-
> check-status.sh
> Connection Status 'app-db-master' [OK]
> Connection Status 'app-db-slave' [OK]
> Slave Thread Status [KO]
> Error reports:
>     No slave (maybe because we cannot check a server).
> Position Status [SKIP]
> Error reports:
>     Skip because we can't identify a unique slave.
> 
> From what I understand the is_slave function from https://github.com/
> ClusterLabs/resource-agents/blob/RHEL6/heartbeat/mysql works as
> expected, because as it gets an empty set when performing the monitor
> action it does not consider it as a replication slave, so I guess
> there is an issue from the issue already presented above and the
> CHANGE_MASTER_TO query that failed because of error "ERROR 1200
> (HY000) at line 1: Misconfigured slave: MASTER_HOST was not set;"


That looks accurate. I think this would be worth opening an issue at
https://github.com/ClusterLabs/resource-agents/issues
so it is at least documented.

Not directly helpful, but maybe worth mentioning -- there seems to be
more community activity around the pgsql and galera agents. You may
have to do more of your own code-diving with mysql. It may be worth
comparing the various agents to see how they handle various situations.
(Note there is the https://github.com/ClusterLabs/PAF agent as well as
the heartbeat pgsql agent.)

> I may miss something obvious .. Please tell me if I can bring more
> information around my issue. 
> 
> Rgds
> 
> > > Symetric cluster
> > > configured.
> > > 
> > > Details of my pacemaker resource configuration is
> > > 
> > >  Master: ms_mysql-master
> > >   Meta Attrs: master-node-max=1 clone_max=2 globally-unique=false
> > > clone-node-max=1 notify=true
> > >   Resource: ms_mysql (class=ocf provider=heartbeat type=mysql)
> > >    Attributes: binary=/usr/bin/mysqld_safe
> > config=/etc/my.cnf.d/server.cnf
> > > datadir=/var/lib/mysql evict_outdated_slaves=false
> > max_slave_lag=15
> > > pid=/var/lib/mysql/mysql.pid replication_passwd=mysqlreplpw
> > > replication_user=user-repl socket=/var/lib/mysql/mysql.sock
> > > test_passwd=mysqlrootpw test_user=root
> > >    Operations: demote interval=0s timeout=120 (ms_mysql-demote-
> > interval-0s)
> > >                monitor interval=20 timeout=30 (ms_mysql-monitor-
> > interval-20)
> > >                monitor interval=10 role=Master timeout=30
> > > (ms_mysql-monitor-interval-10)
> > >                monitor interval=30 role=Slave timeout=30
> > > (ms_mysql-monitor-interval-30)
> > >                notify interval=0s timeout=90 (ms_mysql-notify-
> > interval-0s)
> > >                promote interval=0s timeout=120
> > > (ms_mysql-promote-interval-0s)
> > >                start interval=0s timeout=120 (ms_mysql-start-
> > interval-0s)
> > >                stop interval=0s timeout=120 (ms_mysql-stop-
> > interval-0s)
> > > 
> > > Any things I'm missing on this ? Did not find a clearly similar
> > usecase
> > > when googling around network outage and pacemaker.
> > > 
> > > Thanks

-- 
Ken Gaillot <kgail...@redhat.com>
_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Questions about pacemaker/ mysql resource agent behaviour when network fail

Reply via email to