Re: [ClusterLabs] Questions about pacemaker/ mysql resource agent behaviour when network fail
10.10.2018 13:18, Simon Bomm пишет: > Le sam. 6 oct. 2018 à 06:13, Andrei Borzenkov a > écrit : > >> 05.10.2018 15:00, Simon Bomm пишет: >>> Hi all, >>> >>> Using pacemaker 1.1.18-11 and mysql resource agent ( >>> >> https://github.com/ClusterLabs/resource-agents/blob/RHEL6/heartbeat/mysql >> ), >>> I run into an unwanted behaviour. My point of view of course, maybe it's >>> expected to be as it is that's why I ask. >>> >>> # My test case is the following : >>> >>> Everything is OK on my cluster, crm_mon output is as below (no failed >>> actions) >>> >>> Master/Slave Set: ms_mysql-master [ms_mysql] >>> Masters: [ db-master ] >>> Slaves: [ db-slave ] >>> >>> 1. I insert in a table on master, no issue data is replicated. >>> 2. I shut down net int on the master (vm), >> >> > First, thanks for taking time to answer me > > >> What exactly does it mean? How do you shut down net? >> >> > Disconnect the network card from VMWare vSphere Console > > >>> pacemaker correctly start on the >>> other node. Master is seen as offline, and db-slave is now master >>> >>> Master/Slave Set: ms_mysql-master [ms_mysql] >>> Masters: [ db-slave ] >>> >>> 3. I bring back my net int up, pacemaker see the node online and set the >>> old-master as a the new slave : >>> >>> Master/Slave Set: ms_mysql-master [ms_mysql] >>> Masters: [ db-slave ] >>> Slaves: [ db-master ] >>> >>> 4. From this point, my external monitoring bash script shows that SQL and >>> IO thread are not running, but I can't see any error in the pcs >>> status/crm_mon outputs. >> >> Pacemaker just shows what resource agents claim. If resource agent >> claims resource is started, there is nothing pacemaker can do. You need >> to debug what resource agent does. >> >> > I've debugged it quite a lot, and that's what drove me to isolate error > below : > >> mysql -h localhost -u user-repl -pmysqlreplpw -e "START SLAVE" >> ERROR 1200 (HY000) at line 1: Misconfigured slave: MASTER_HOST was not > set; >> Fix in config file or with CHANGE MASTER TO > > > >>> Consequence is that I continue inserting on my new >>> promoted master but the data is never consumed by my former master >> computer. >>> >>> # Questions : >>> >>> - Is this some kind of safety behaviour to avoid data corruption when a >>> node is back online ? >>> - When I want to manually start it like ocf does it returns this error : >>> >>> mysql -h localhost -u user-repl -pmysqlreplpw -e "START SLAVE" >>> ERROR 1200 (HY000) at line 1: Misconfigured slave: MASTER_HOST was not >> set; >>> Fix in config file or with CHANGE MASTER TO >>> >>> - I would expect the cluster to stop the slave and show a failed action, >> am >>> I wrong here ? >>> >> >> I am not familiar with specific application and its structure. From >> quick browsing monitor action does mostly check for running process. Is >> mySQL process running? >> > > Yes it is, as you mentionned previously the config wants pacemaker to start > mysql resource so no problems. > >> >>> # Other details (not sure it matters a lot) >>> >>> No stonith enabled, no fencing or auto-failback. >> >> How are you going to resolve split-brain without stonith? "Stopping net" >> sounds exactly like split brain, in which case further investigation is >> rather pointless. >> >> > You make the point, as I'm not very familiar with stonithd, I first disable > this to avoid unwanted behaviour but I'll definitely follow your advise and > dig around. > > >> Anyway, to give some non-hypothetical answer full configuration and logs >> from both systems are needed. >> >> > Sure, please find the full configuration > > Cluster Name: app_cluster > Corosync Nodes: > app-central-master app-central-slave app-db-master app-db-slave app-quorum > Pacemaker Nodes: > app-central-master app-central-slave app-db-master app-db-slave app-quorum > > Resources: > Master: ms_mysql-master > Meta Attrs: master-node-max=1 clone_max=2 globally-unique=false > clone-node-max=1 notify=true > Resource: ms_mysql (class=ocf provider=heartbeat type=mysql-app) >Attributes: binary=/usr/bin/mysqld_safe config=/etc/my.cnf.d/server.cnf > datadir=/var/lib/mysql evict_outdated_slaves=false max_slave_lag=15 > pid=/var/lib/mysql/mysql.pid replication_passwd=mysqlreplpw > replication_user=app-repl socket=/var/lib/mysql/mysql.sock > test_passwd=mysqlrootpw test_user=root >Operations: demote interval=0s timeout=120 (ms_mysql-demote-interval-0s) >monitor interval=20 timeout=30 (ms_mysql-monitor-interval-20) >monitor interval=10 role=Master timeout=30 > (ms_mysql-monitor-interval-10) >monitor interval=30 role=Slave timeout=30 > (ms_mysql-monitor-interval-30) >notify interval=0s timeout=90 (ms_mysql-notify-interval-0s) >promote interval=0s timeout=120 > (ms_mysql-promote-interval-0s) >start interval=0s timeout=120 (ms_mysql-start-interval-0s) >stop interval=0s
Re: [ClusterLabs] Questions about pacemaker/ mysql resource agent behaviour when network fail
On Wed, 2018-10-10 at 12:18 +0200, Simon Bomm wrote: > Le sam. 6 oct. 2018 à 06:13, Andrei Borzenkov a > écrit : > > 05.10.2018 15:00, Simon Bomm пишет: > > > Hi all, > > > > > > Using pacemaker 1.1.18-11 and mysql resource agent ( > > > https://github.com/ClusterLabs/resource-agents/blob/RHEL6/heartbe > > at/mysql), > > > I run into an unwanted behaviour. My point of view of course, > > maybe it's > > > expected to be as it is that's why I ask. > > > > > > # My test case is the following : > > > > > > Everything is OK on my cluster, crm_mon output is as below (no > > failed > > > actions) > > > > > > Master/Slave Set: ms_mysql-master [ms_mysql] > > > Masters: [ db-master ] > > > Slaves: [ db-slave ] > > > > > > 1. I insert in a table on master, no issue data is replicated. > > > 2. I shut down net int on the master (vm), > > > > First, thanks for taking time to answer me > > > What exactly does it mean? How do you shut down net? > > > > Disconnect the network card from VMWare vSphere Console > > > > pacemaker correctly start on the > > > other node. Master is seen as offline, and db-slave is now master > > > > > > Master/Slave Set: ms_mysql-master [ms_mysql] > > > Masters: [ db-slave ] > > > > > > 3. I bring back my net int up, pacemaker see the node online and > > set the > > > old-master as a the new slave : > > > > > > Master/Slave Set: ms_mysql-master [ms_mysql] > > > Masters: [ db-slave ] > > > Slaves: [ db-master ] > > > > > > 4. From this point, my external monitoring bash script shows that > > SQL and > > > IO thread are not running, but I can't see any error in the pcs > > > status/crm_mon outputs. > > > > Pacemaker just shows what resource agents claim. If resource agent > > claims resource is started, there is nothing pacemaker can do. You > > need > > to debug what resource agent does. > > > > I've debugged it quite a lot, and that's what drove me to isolate > error below : > > > mysql -h localhost -u user-repl -pmysqlreplpw -e "START SLAVE" > > ERROR 1200 (HY000) at line 1: Misconfigured slave: MASTER_HOST was > not set; > > Fix in config file or with CHANGE MASTER TO > > > > > Consequence is that I continue inserting on my new > > > promoted master but the data is never consumed by my former > > master computer. > > > > > > # Questions : > > > > > > - Is this some kind of safety behaviour to avoid data corruption > > when a > > > node is back online ? > > > - When I want to manually start it like ocf does it returns this > > error : > > > > > > mysql -h localhost -u user-repl -pmysqlreplpw -e "START SLAVE" > > > ERROR 1200 (HY000) at line 1: Misconfigured slave: MASTER_HOST > > was not set; > > > Fix in config file or with CHANGE MASTER TO > > > > > > - I would expect the cluster to stop the slave and show a failed > > action, am > > > I wrong here ? > > > > > > > I am not familiar with specific application and its structure. From > > quick browsing monitor action does mostly check for running > > process. Is > > mySQL process running? > > Yes it is, as you mentionned previously the config wants pacemaker to > start mysql resource so no problems. > > > # Other details (not sure it matters a lot) > > > > > > No stonith enabled, no fencing or auto-failback. > > > > How are you going to resolve split-brain without stonith? "Stopping > > net" > > sounds exactly like split brain, in which case further > > investigation is > > rather pointless. > > > > You make the point, as I'm not very familiar with stonithd, I first > disable this to avoid unwanted behaviour but I'll definitely follow > your advise and dig around. > > > Anyway, to give some non-hypothetical answer full configuration and > > logs > > from both systems are needed. > > > > Sure, please find the full configuration > > Cluster Name: app_cluster > Corosync Nodes: > app-central-master app-central-slave app-db-master app-db-slave app- > quorum > Pacemaker Nodes: > app-central-master app-central-slave app-db-master app-db-slave app- > quorum > > Resources: > Master: ms_mysql-master > Meta Attrs: master-node-max=1 clone_max=2 globally-unique=false > clone-node-max=1 notify=true > Resource: ms_mysql (class=ocf provider=heartbeat type=mysql-app) > Attributes: binary=/usr/bin/mysqld_safe > config=/etc/my.cnf.d/server.cnf datadir=/var/lib/mysql > evict_outdated_slaves=false max_slave_lag=15 > pid=/var/lib/mysql/mysql.pid replication_passwd=mysqlreplpw > replication_user=app-repl socket=/var/lib/mysql/mysql.sock > test_passwd=mysqlrootpw test_user=root > Operations: demote interval=0s timeout=120 (ms_mysql-demote- > interval-0s) > monitor interval=20 timeout=30 (ms_mysql-monitor- > interval-20) > monitor interval=10 role=Master timeout=30 (ms_mysql- > monitor-interval-10) > monitor interval=30 role=Slave timeout=30 (ms_mysql- > monitor-interval-30) > notify interval=0s timeout=90 (ms_mysql-notify-
Re: [ClusterLabs] Questions about pacemaker/ mysql resource agent behaviour when network fail
Le sam. 6 oct. 2018 à 06:13, Andrei Borzenkov a écrit : > 05.10.2018 15:00, Simon Bomm пишет: > > Hi all, > > > > Using pacemaker 1.1.18-11 and mysql resource agent ( > > > https://github.com/ClusterLabs/resource-agents/blob/RHEL6/heartbeat/mysql > ), > > I run into an unwanted behaviour. My point of view of course, maybe it's > > expected to be as it is that's why I ask. > > > > # My test case is the following : > > > > Everything is OK on my cluster, crm_mon output is as below (no failed > > actions) > > > > Master/Slave Set: ms_mysql-master [ms_mysql] > > Masters: [ db-master ] > > Slaves: [ db-slave ] > > > > 1. I insert in a table on master, no issue data is replicated. > > 2. I shut down net int on the master (vm), > > First, thanks for taking time to answer me > What exactly does it mean? How do you shut down net? > > Disconnect the network card from VMWare vSphere Console > > pacemaker correctly start on the > > other node. Master is seen as offline, and db-slave is now master > > > > Master/Slave Set: ms_mysql-master [ms_mysql] > > Masters: [ db-slave ] > > > > 3. I bring back my net int up, pacemaker see the node online and set the > > old-master as a the new slave : > > > > Master/Slave Set: ms_mysql-master [ms_mysql] > > Masters: [ db-slave ] > > Slaves: [ db-master ] > > > > 4. From this point, my external monitoring bash script shows that SQL and > > IO thread are not running, but I can't see any error in the pcs > > status/crm_mon outputs. > > Pacemaker just shows what resource agents claim. If resource agent > claims resource is started, there is nothing pacemaker can do. You need > to debug what resource agent does. > > I've debugged it quite a lot, and that's what drove me to isolate error below : > mysql -h localhost -u user-repl -pmysqlreplpw -e "START SLAVE" > ERROR 1200 (HY000) at line 1: Misconfigured slave: MASTER_HOST was not set; > Fix in config file or with CHANGE MASTER TO > > Consequence is that I continue inserting on my new > > promoted master but the data is never consumed by my former master > computer. > > > > # Questions : > > > > - Is this some kind of safety behaviour to avoid data corruption when a > > node is back online ? > > - When I want to manually start it like ocf does it returns this error : > > > > mysql -h localhost -u user-repl -pmysqlreplpw -e "START SLAVE" > > ERROR 1200 (HY000) at line 1: Misconfigured slave: MASTER_HOST was not > set; > > Fix in config file or with CHANGE MASTER TO > > > > - I would expect the cluster to stop the slave and show a failed action, > am > > I wrong here ? > > > > I am not familiar with specific application and its structure. From > quick browsing monitor action does mostly check for running process. Is > mySQL process running? > Yes it is, as you mentionned previously the config wants pacemaker to start mysql resource so no problems. > > > # Other details (not sure it matters a lot) > > > > No stonith enabled, no fencing or auto-failback. > > How are you going to resolve split-brain without stonith? "Stopping net" > sounds exactly like split brain, in which case further investigation is > rather pointless. > > You make the point, as I'm not very familiar with stonithd, I first disable this to avoid unwanted behaviour but I'll definitely follow your advise and dig around. > Anyway, to give some non-hypothetical answer full configuration and logs > from both systems are needed. > > Sure, please find the full configuration Cluster Name: app_cluster Corosync Nodes: app-central-master app-central-slave app-db-master app-db-slave app-quorum Pacemaker Nodes: app-central-master app-central-slave app-db-master app-db-slave app-quorum Resources: Master: ms_mysql-master Meta Attrs: master-node-max=1 clone_max=2 globally-unique=false clone-node-max=1 notify=true Resource: ms_mysql (class=ocf provider=heartbeat type=mysql-app) Attributes: binary=/usr/bin/mysqld_safe config=/etc/my.cnf.d/server.cnf datadir=/var/lib/mysql evict_outdated_slaves=false max_slave_lag=15 pid=/var/lib/mysql/mysql.pid replication_passwd=mysqlreplpw replication_user=app-repl socket=/var/lib/mysql/mysql.sock test_passwd=mysqlrootpw test_user=root Operations: demote interval=0s timeout=120 (ms_mysql-demote-interval-0s) monitor interval=20 timeout=30 (ms_mysql-monitor-interval-20) monitor interval=10 role=Master timeout=30 (ms_mysql-monitor-interval-10) monitor interval=30 role=Slave timeout=30 (ms_mysql-monitor-interval-30) notify interval=0s timeout=90 (ms_mysql-notify-interval-0s) promote interval=0s timeout=120 (ms_mysql-promote-interval-0s) start interval=0s timeout=120 (ms_mysql-start-interval-0s) stop interval=0s timeout=120 (ms_mysql-stop-interval-0s) Resource: vip_mysql (class=ocf provider=heartbeat type=IPaddr2-app) Attributes: broadcast=10.30.255.255 cidr_netmask=16 flush_routes=true
Re: [ClusterLabs] Questions about pacemaker/ mysql resource agent behaviour when network fail
05.10.2018 15:00, Simon Bomm пишет: > Hi all, > > Using pacemaker 1.1.18-11 and mysql resource agent ( > https://github.com/ClusterLabs/resource-agents/blob/RHEL6/heartbeat/mysql), > I run into an unwanted behaviour. My point of view of course, maybe it's > expected to be as it is that's why I ask. > > # My test case is the following : > > Everything is OK on my cluster, crm_mon output is as below (no failed > actions) > > Master/Slave Set: ms_mysql-master [ms_mysql] > Masters: [ db-master ] > Slaves: [ db-slave ] > > 1. I insert in a table on master, no issue data is replicated. > 2. I shut down net int on the master (vm), What exactly does it mean? How do you shut down net? > pacemaker correctly start on the > other node. Master is seen as offline, and db-slave is now master > > Master/Slave Set: ms_mysql-master [ms_mysql] > Masters: [ db-slave ] > > 3. I bring back my net int up, pacemaker see the node online and set the > old-master as a the new slave : > > Master/Slave Set: ms_mysql-master [ms_mysql] > Masters: [ db-slave ] > Slaves: [ db-master ] > > 4. From this point, my external monitoring bash script shows that SQL and > IO thread are not running, but I can't see any error in the pcs > status/crm_mon outputs. Pacemaker just shows what resource agents claim. If resource agent claims resource is started, there is nothing pacemaker can do. You need to debug what resource agent does. > Consequence is that I continue inserting on my new > promoted master but the data is never consumed by my former master computer. > > # Questions : > > - Is this some kind of safety behaviour to avoid data corruption when a > node is back online ? > - When I want to manually start it like ocf does it returns this error : > > mysql -h localhost -u user-repl -pmysqlreplpw -e "START SLAVE" > ERROR 1200 (HY000) at line 1: Misconfigured slave: MASTER_HOST was not set; > Fix in config file or with CHANGE MASTER TO > > - I would expect the cluster to stop the slave and show a failed action, am > I wrong here ? > I am not familiar with specific application and its structure. From quick browsing monitor action does mostly check for running process. Is mySQL process running? > # Other details (not sure it matters a lot) > > No stonith enabled, no fencing or auto-failback. How are you going to resolve split-brain without stonith? "Stopping net" sounds exactly like split brain, in which case further investigation is rather pointless. Anyway, to give some non-hypothetical answer full configuration and logs from both systems are needed. > Symetric cluster > configured. > > Details of my pacemaker resource configuration is > > Master: ms_mysql-master > Meta Attrs: master-node-max=1 clone_max=2 globally-unique=false > clone-node-max=1 notify=true > Resource: ms_mysql (class=ocf provider=heartbeat type=mysql) >Attributes: binary=/usr/bin/mysqld_safe config=/etc/my.cnf.d/server.cnf > datadir=/var/lib/mysql evict_outdated_slaves=false max_slave_lag=15 > pid=/var/lib/mysql/mysql.pid replication_passwd=mysqlreplpw > replication_user=user-repl socket=/var/lib/mysql/mysql.sock > test_passwd=mysqlrootpw test_user=root >Operations: demote interval=0s timeout=120 (ms_mysql-demote-interval-0s) >monitor interval=20 timeout=30 (ms_mysql-monitor-interval-20) >monitor interval=10 role=Master timeout=30 > (ms_mysql-monitor-interval-10) >monitor interval=30 role=Slave timeout=30 > (ms_mysql-monitor-interval-30) >notify interval=0s timeout=90 (ms_mysql-notify-interval-0s) >promote interval=0s timeout=120 > (ms_mysql-promote-interval-0s) >start interval=0s timeout=120 (ms_mysql-start-interval-0s) >stop interval=0s timeout=120 (ms_mysql-stop-interval-0s) > > Any things I'm missing on this ? Did not find a clearly similar usecase > when googling around network outage and pacemaker. > > Thanks > > > > ___ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org