Re: [ClusterLabs] Questions about pacemaker/ mysql resource agent behaviour when network fail

2018-10-10 Thread Andrei Borzenkov
10.10.2018 13:18, Simon Bomm пишет:
> Le sam. 6 oct. 2018 à 06:13, Andrei Borzenkov  a
> écrit :
> 
>> 05.10.2018 15:00, Simon Bomm пишет:
>>> Hi all,
>>>
>>> Using pacemaker 1.1.18-11 and mysql resource agent (
>>>
>> https://github.com/ClusterLabs/resource-agents/blob/RHEL6/heartbeat/mysql
>> ),
>>> I run into an unwanted behaviour. My point of view of course, maybe it's
>>> expected to be as it is that's why I ask.
>>>
>>> # My test case is the following :
>>>
>>> Everything is OK on my cluster, crm_mon output is as below (no failed
>>> actions)
>>>
>>>  Master/Slave Set: ms_mysql-master [ms_mysql]
>>>  Masters: [ db-master ]
>>>  Slaves: [ db-slave ]
>>>
>>> 1. I insert in a table on master, no issue data is replicated.
>>> 2. I shut down net int on the master (vm),
>>
>>
> First, thanks for taking time to answer me
> 
> 
>> What exactly does it mean? How do you shut down net?
>>
>>
> Disconnect the network card from VMWare vSphere Console
> 
> 
>>> pacemaker correctly start on the
>>> other node. Master is seen as offline, and db-slave is now master
>>>
>>>  Master/Slave Set: ms_mysql-master [ms_mysql]
>>>  Masters: [ db-slave ]
>>>
>>> 3. I bring back my net int up, pacemaker see the node online and set the
>>> old-master as a the new slave :
>>>
>>>  Master/Slave Set: ms_mysql-master [ms_mysql]
>>>  Masters: [ db-slave ]
>>>  Slaves: [ db-master ]
>>>
>>> 4. From this point, my external monitoring bash script shows that SQL and
>>> IO thread are not running, but I can't see any error in the pcs
>>> status/crm_mon outputs.
>>
>> Pacemaker just shows what resource agents claim. If resource agent
>> claims resource is started, there is nothing pacemaker can do. You need
>> to debug what resource agent does.
>>
>>
> I've debugged it quite a lot, and that's what drove me to isolate error
> below :
> 
>> mysql -h localhost -u user-repl -pmysqlreplpw -e "START SLAVE"
>> ERROR 1200 (HY000) at line 1: Misconfigured slave: MASTER_HOST was not
> set;
>> Fix in config file or with CHANGE MASTER TO
> 
> 
> 
>>> Consequence is that I continue inserting on my new
>>> promoted master but the data is never consumed by my former master
>> computer.
>>>
>>> # Questions :
>>>
>>> - Is this some kind of safety behaviour to avoid data corruption when a
>>> node is back online ?
>>> - When I want to manually start it like ocf does it returns this error :
>>>
>>> mysql -h localhost -u user-repl -pmysqlreplpw -e "START SLAVE"
>>> ERROR 1200 (HY000) at line 1: Misconfigured slave: MASTER_HOST was not
>> set;
>>> Fix in config file or with CHANGE MASTER TO
>>>
>>> - I would expect the cluster to stop the slave and show a failed action,
>> am
>>> I wrong here ?
>>>
>>
>> I am not familiar with specific application and its structure. From
>> quick browsing monitor action does mostly check for running process. Is
>> mySQL process running?
>>
> 
> Yes it is, as you mentionned previously the config wants pacemaker to start
> mysql resource so no problems.
> 
>>
>>> # Other details (not sure it matters a lot)
>>>
>>> No stonith enabled, no fencing or auto-failback.
>>
>> How are you going to resolve split-brain without stonith? "Stopping net"
>> sounds exactly like split brain, in which case further investigation is
>> rather pointless.
>>
>>
> You make the point, as I'm not very familiar with stonithd, I first disable
> this to avoid unwanted behaviour but I'll definitely follow your advise and
> dig around.
> 
> 
>> Anyway, to give some non-hypothetical answer full configuration and logs
>> from both systems are needed.
>>
>>
> Sure, please find the full configuration
> 
> Cluster Name: app_cluster
> Corosync Nodes:
>  app-central-master app-central-slave app-db-master app-db-slave app-quorum
> Pacemaker Nodes:
>  app-central-master app-central-slave app-db-master app-db-slave app-quorum
> 
> Resources:
>  Master: ms_mysql-master
>   Meta Attrs: master-node-max=1 clone_max=2 globally-unique=false
> clone-node-max=1 notify=true
>   Resource: ms_mysql (class=ocf provider=heartbeat type=mysql-app)
>Attributes: binary=/usr/bin/mysqld_safe config=/etc/my.cnf.d/server.cnf
> datadir=/var/lib/mysql evict_outdated_slaves=false max_slave_lag=15
> pid=/var/lib/mysql/mysql.pid replication_passwd=mysqlreplpw
> replication_user=app-repl socket=/var/lib/mysql/mysql.sock
> test_passwd=mysqlrootpw test_user=root
>Operations: demote interval=0s timeout=120 (ms_mysql-demote-interval-0s)
>monitor interval=20 timeout=30 (ms_mysql-monitor-interval-20)
>monitor interval=10 role=Master timeout=30
> (ms_mysql-monitor-interval-10)
>monitor interval=30 role=Slave timeout=30
> (ms_mysql-monitor-interval-30)
>notify interval=0s timeout=90 (ms_mysql-notify-interval-0s)
>promote interval=0s timeout=120
> (ms_mysql-promote-interval-0s)
>start interval=0s timeout=120 (ms_mysql-start-interval-0s)
>stop interval=0s 

Re: [ClusterLabs] Questions about pacemaker/ mysql resource agent behaviour when network fail

2018-10-10 Thread Ken Gaillot
On Wed, 2018-10-10 at 12:18 +0200, Simon Bomm wrote:
> Le sam. 6 oct. 2018 à 06:13, Andrei Borzenkov  a
> écrit :
> > 05.10.2018 15:00, Simon Bomm пишет:
> > > Hi all,
> > > 
> > > Using pacemaker 1.1.18-11 and mysql resource agent (
> > > https://github.com/ClusterLabs/resource-agents/blob/RHEL6/heartbe
> > at/mysql),
> > > I run into an unwanted behaviour. My point of view of course,
> > maybe it's
> > > expected to be as it is that's why I ask.
> > > 
> > > # My test case is the following :
> > > 
> > > Everything is OK on my cluster, crm_mon output is as below (no
> > failed
> > > actions)
> > > 
> > >  Master/Slave Set: ms_mysql-master [ms_mysql]
> > >      Masters: [ db-master ]
> > >      Slaves: [ db-slave ]
> > > 
> > > 1. I insert in a table on master, no issue data is replicated.
> > > 2. I shut down net int on the master (vm),
> > 
> 
> First, thanks for taking time to answer me
>  
> > What exactly does it mean? How do you shut down net?
> > 
> 
> Disconnect the network card from VMWare vSphere Console
>  
> > > pacemaker correctly start on the
> > > other node. Master is seen as offline, and db-slave is now master
> > > 
> > >  Master/Slave Set: ms_mysql-master [ms_mysql]
> > >      Masters: [ db-slave ]
> > > 
> > > 3. I bring back my net int up, pacemaker see the node online and
> > set the
> > > old-master as a the new slave :
> > > 
> > >  Master/Slave Set: ms_mysql-master [ms_mysql]
> > >      Masters: [ db-slave ]
> > >      Slaves: [ db-master ]
> > > 
> > > 4. From this point, my external monitoring bash script shows that
> > SQL and
> > > IO thread are not running, but I can't see any error in the pcs
> > > status/crm_mon outputs.
> > 
> > Pacemaker just shows what resource agents claim. If resource agent
> > claims resource is started, there is nothing pacemaker can do. You
> > need
> > to debug what resource agent does.
> > 
> 
> I've debugged it quite a lot, and that's what drove me to isolate
> error below : 
> 
> > mysql -h localhost -u user-repl -pmysqlreplpw -e "START SLAVE"
> > ERROR 1200 (HY000) at line 1: Misconfigured slave: MASTER_HOST was
> not set;
> > Fix in config file or with CHANGE MASTER TO  
> 
>  
> > > Consequence is that I continue inserting on my new
> > > promoted master but the data is never consumed by my former
> > master computer.
> > > 
> > > # Questions :
> > > 
> > > - Is this some kind of safety behaviour to avoid data corruption
> > when a
> > > node is back online ?
> > > - When I want to manually start it like ocf does it returns this
> > error :
> > > 
> > > mysql -h localhost -u user-repl -pmysqlreplpw -e "START SLAVE"
> > > ERROR 1200 (HY000) at line 1: Misconfigured slave: MASTER_HOST
> > was not set;
> > > Fix in config file or with CHANGE MASTER TO
> > > 
> > > - I would expect the cluster to stop the slave and show a failed
> > action, am
> > > I wrong here ?
> > > 
> > 
> > I am not familiar with specific application and its structure. From
> > quick browsing monitor action does mostly check for running
> > process. Is
> > mySQL process running?
> 
> Yes it is, as you mentionned previously the config wants pacemaker to
> start mysql resource so no problems.
> > > # Other details (not sure it matters a lot)
> > > 
> > > No stonith enabled, no fencing or auto-failback.
> > 
> > How are you going to resolve split-brain without stonith? "Stopping
> > net"
> > sounds exactly like split brain, in which case further
> > investigation is
> > rather pointless.
> > 
> 
> You make the point, as I'm not very familiar with stonithd, I first
> disable this to avoid unwanted behaviour but I'll definitely follow
> your advise and dig around.
>  
> > Anyway, to give some non-hypothetical answer full configuration and
> > logs
> > from both systems are needed.
> > 
> 
> Sure, please find the full configuration 
> 
> Cluster Name: app_cluster
> Corosync Nodes:
>  app-central-master app-central-slave app-db-master app-db-slave app-
> quorum
> Pacemaker Nodes:
>  app-central-master app-central-slave app-db-master app-db-slave app-
> quorum
> 
> Resources:
>  Master: ms_mysql-master
>   Meta Attrs: master-node-max=1 clone_max=2 globally-unique=false
> clone-node-max=1 notify=true
>   Resource: ms_mysql (class=ocf provider=heartbeat type=mysql-app)
>    Attributes: binary=/usr/bin/mysqld_safe
> config=/etc/my.cnf.d/server.cnf datadir=/var/lib/mysql
> evict_outdated_slaves=false max_slave_lag=15
> pid=/var/lib/mysql/mysql.pid replication_passwd=mysqlreplpw
> replication_user=app-repl socket=/var/lib/mysql/mysql.sock
> test_passwd=mysqlrootpw test_user=root
>    Operations: demote interval=0s timeout=120 (ms_mysql-demote-
> interval-0s)
>                monitor interval=20 timeout=30 (ms_mysql-monitor-
> interval-20)
>                monitor interval=10 role=Master timeout=30 (ms_mysql-
> monitor-interval-10)
>                monitor interval=30 role=Slave timeout=30 (ms_mysql-
> monitor-interval-30)
>                notify interval=0s timeout=90 (ms_mysql-notify-

Re: [ClusterLabs] Questions about pacemaker/ mysql resource agent behaviour when network fail

2018-10-10 Thread Simon Bomm
Le sam. 6 oct. 2018 à 06:13, Andrei Borzenkov  a
écrit :

> 05.10.2018 15:00, Simon Bomm пишет:
> > Hi all,
> >
> > Using pacemaker 1.1.18-11 and mysql resource agent (
> >
> https://github.com/ClusterLabs/resource-agents/blob/RHEL6/heartbeat/mysql
> ),
> > I run into an unwanted behaviour. My point of view of course, maybe it's
> > expected to be as it is that's why I ask.
> >
> > # My test case is the following :
> >
> > Everything is OK on my cluster, crm_mon output is as below (no failed
> > actions)
> >
> >  Master/Slave Set: ms_mysql-master [ms_mysql]
> >  Masters: [ db-master ]
> >  Slaves: [ db-slave ]
> >
> > 1. I insert in a table on master, no issue data is replicated.
> > 2. I shut down net int on the master (vm),
>
>
First, thanks for taking time to answer me


> What exactly does it mean? How do you shut down net?
>
>
Disconnect the network card from VMWare vSphere Console


> > pacemaker correctly start on the
> > other node. Master is seen as offline, and db-slave is now master
> >
> >  Master/Slave Set: ms_mysql-master [ms_mysql]
> >  Masters: [ db-slave ]
> >
> > 3. I bring back my net int up, pacemaker see the node online and set the
> > old-master as a the new slave :
> >
> >  Master/Slave Set: ms_mysql-master [ms_mysql]
> >  Masters: [ db-slave ]
> >  Slaves: [ db-master ]
> >
> > 4. From this point, my external monitoring bash script shows that SQL and
> > IO thread are not running, but I can't see any error in the pcs
> > status/crm_mon outputs.
>
> Pacemaker just shows what resource agents claim. If resource agent
> claims resource is started, there is nothing pacemaker can do. You need
> to debug what resource agent does.
>
>
I've debugged it quite a lot, and that's what drove me to isolate error
below :

> mysql -h localhost -u user-repl -pmysqlreplpw -e "START SLAVE"
> ERROR 1200 (HY000) at line 1: Misconfigured slave: MASTER_HOST was not
set;
> Fix in config file or with CHANGE MASTER TO



> > Consequence is that I continue inserting on my new
> > promoted master but the data is never consumed by my former master
> computer.
> >
> > # Questions :
> >
> > - Is this some kind of safety behaviour to avoid data corruption when a
> > node is back online ?
> > - When I want to manually start it like ocf does it returns this error :
> >
> > mysql -h localhost -u user-repl -pmysqlreplpw -e "START SLAVE"
> > ERROR 1200 (HY000) at line 1: Misconfigured slave: MASTER_HOST was not
> set;
> > Fix in config file or with CHANGE MASTER TO
> >
> > - I would expect the cluster to stop the slave and show a failed action,
> am
> > I wrong here ?
> >
>
> I am not familiar with specific application and its structure. From
> quick browsing monitor action does mostly check for running process. Is
> mySQL process running?
>

Yes it is, as you mentionned previously the config wants pacemaker to start
mysql resource so no problems.

>
> > # Other details (not sure it matters a lot)
> >
> > No stonith enabled, no fencing or auto-failback.
>
> How are you going to resolve split-brain without stonith? "Stopping net"
> sounds exactly like split brain, in which case further investigation is
> rather pointless.
>
>
You make the point, as I'm not very familiar with stonithd, I first disable
this to avoid unwanted behaviour but I'll definitely follow your advise and
dig around.


> Anyway, to give some non-hypothetical answer full configuration and logs
> from both systems are needed.
>
>
Sure, please find the full configuration

Cluster Name: app_cluster
Corosync Nodes:
 app-central-master app-central-slave app-db-master app-db-slave app-quorum
Pacemaker Nodes:
 app-central-master app-central-slave app-db-master app-db-slave app-quorum

Resources:
 Master: ms_mysql-master
  Meta Attrs: master-node-max=1 clone_max=2 globally-unique=false
clone-node-max=1 notify=true
  Resource: ms_mysql (class=ocf provider=heartbeat type=mysql-app)
   Attributes: binary=/usr/bin/mysqld_safe config=/etc/my.cnf.d/server.cnf
datadir=/var/lib/mysql evict_outdated_slaves=false max_slave_lag=15
pid=/var/lib/mysql/mysql.pid replication_passwd=mysqlreplpw
replication_user=app-repl socket=/var/lib/mysql/mysql.sock
test_passwd=mysqlrootpw test_user=root
   Operations: demote interval=0s timeout=120 (ms_mysql-demote-interval-0s)
   monitor interval=20 timeout=30 (ms_mysql-monitor-interval-20)
   monitor interval=10 role=Master timeout=30
(ms_mysql-monitor-interval-10)
   monitor interval=30 role=Slave timeout=30
(ms_mysql-monitor-interval-30)
   notify interval=0s timeout=90 (ms_mysql-notify-interval-0s)
   promote interval=0s timeout=120
(ms_mysql-promote-interval-0s)
   start interval=0s timeout=120 (ms_mysql-start-interval-0s)
   stop interval=0s timeout=120 (ms_mysql-stop-interval-0s)
 Resource: vip_mysql (class=ocf provider=heartbeat type=IPaddr2-app)
  Attributes: broadcast=10.30.255.255 cidr_netmask=16 flush_routes=true

Re: [ClusterLabs] Questions about pacemaker/ mysql resource agent behaviour when network fail

2018-10-05 Thread Andrei Borzenkov
05.10.2018 15:00, Simon Bomm пишет:
> Hi all,
> 
> Using pacemaker 1.1.18-11 and mysql resource agent (
> https://github.com/ClusterLabs/resource-agents/blob/RHEL6/heartbeat/mysql),
> I run into an unwanted behaviour. My point of view of course, maybe it's
> expected to be as it is that's why I ask.
> 
> # My test case is the following :
> 
> Everything is OK on my cluster, crm_mon output is as below (no failed
> actions)
> 
>  Master/Slave Set: ms_mysql-master [ms_mysql]
>  Masters: [ db-master ]
>  Slaves: [ db-slave ]
> 
> 1. I insert in a table on master, no issue data is replicated.
> 2. I shut down net int on the master (vm),

What exactly does it mean? How do you shut down net?

> pacemaker correctly start on the
> other node. Master is seen as offline, and db-slave is now master
> 
>  Master/Slave Set: ms_mysql-master [ms_mysql]
>  Masters: [ db-slave ]
> 
> 3. I bring back my net int up, pacemaker see the node online and set the
> old-master as a the new slave :
> 
>  Master/Slave Set: ms_mysql-master [ms_mysql]
>  Masters: [ db-slave ]
>  Slaves: [ db-master ]
> 
> 4. From this point, my external monitoring bash script shows that SQL and
> IO thread are not running, but I can't see any error in the pcs
> status/crm_mon outputs.

Pacemaker just shows what resource agents claim. If resource agent
claims resource is started, there is nothing pacemaker can do. You need
to debug what resource agent does.

> Consequence is that I continue inserting on my new
> promoted master but the data is never consumed by my former master computer.
> 
> # Questions :
> 
> - Is this some kind of safety behaviour to avoid data corruption when a
> node is back online ?
> - When I want to manually start it like ocf does it returns this error :
> 
> mysql -h localhost -u user-repl -pmysqlreplpw -e "START SLAVE"
> ERROR 1200 (HY000) at line 1: Misconfigured slave: MASTER_HOST was not set;
> Fix in config file or with CHANGE MASTER TO
> 
> - I would expect the cluster to stop the slave and show a failed action, am
> I wrong here ?
> 

I am not familiar with specific application and its structure. From
quick browsing monitor action does mostly check for running process. Is
mySQL process running?

> # Other details (not sure it matters a lot)
> 
> No stonith enabled, no fencing or auto-failback.

How are you going to resolve split-brain without stonith? "Stopping net"
sounds exactly like split brain, in which case further investigation is
rather pointless.

Anyway, to give some non-hypothetical answer full configuration and logs
from both systems are needed.

> Symetric cluster
> configured.
> 
> Details of my pacemaker resource configuration is
> 
>  Master: ms_mysql-master
>   Meta Attrs: master-node-max=1 clone_max=2 globally-unique=false
> clone-node-max=1 notify=true
>   Resource: ms_mysql (class=ocf provider=heartbeat type=mysql)
>Attributes: binary=/usr/bin/mysqld_safe config=/etc/my.cnf.d/server.cnf
> datadir=/var/lib/mysql evict_outdated_slaves=false max_slave_lag=15
> pid=/var/lib/mysql/mysql.pid replication_passwd=mysqlreplpw
> replication_user=user-repl socket=/var/lib/mysql/mysql.sock
> test_passwd=mysqlrootpw test_user=root
>Operations: demote interval=0s timeout=120 (ms_mysql-demote-interval-0s)
>monitor interval=20 timeout=30 (ms_mysql-monitor-interval-20)
>monitor interval=10 role=Master timeout=30
> (ms_mysql-monitor-interval-10)
>monitor interval=30 role=Slave timeout=30
> (ms_mysql-monitor-interval-30)
>notify interval=0s timeout=90 (ms_mysql-notify-interval-0s)
>promote interval=0s timeout=120
> (ms_mysql-promote-interval-0s)
>start interval=0s timeout=120 (ms_mysql-start-interval-0s)
>stop interval=0s timeout=120 (ms_mysql-stop-interval-0s)
> 
> Any things I'm missing on this ? Did not find a clearly similar usecase
> when googling around network outage and pacemaker.
> 
> Thanks
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org