Re: [Pacemaker] MySQL, Percona replication manager - split brain
В Sat, 25 Oct 2014 23:34:54 +0300 Andrew ni...@seti.kr.ua пишет: 25.10.2014 22:34, Digimer пишет: On 25/10/14 03:32 PM, Andrew wrote: Hi all. I use Percona as RA on cluster (nothing mission-critical, currently - just zabbix data); today after restarting MySQL resource (crm resource restart p_mysql) I've got a split brain state - MySQL for some reason started first at ex-slave node, ex-master starts later (possibly I've set too small timeout to shutdown - only 120s, but I'm not sure). After restart resource on both nodes it seems like mysql replication was ok - but then after ~50min it fails in split brain again for unknown reason (no resource restart was noticed). In 'show replication status' there is an error in table caused by unique index dup. So I have a questions: 1) Which thing causes split brain, and how to avoid it in future? Cause: Logs? ct 25 13:54:13 node2 crmd[29248]: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Oct 25 13:54:13 node2 pengine[29247]: notice: unpack_config: On loss of CCM Quorum: Ignore Oct 25 13:54:13 node2 pengine[29247]: notice: unpack_rsc_op: Operation monitor found resource p_pgsql:0 active in master mode on node1.cluster Oct 25 13:54:13 node2 pengine[29247]: notice: unpack_rsc_op: Operation monitor found resource p_mysql:1 active in master mode on node2.cluster That seems too late. The real cause is that resource was reported as being in master state on both nodes and this happened earlier. Prevent: Fencing (aka stonith). This is why fencing is required. No node failure. Just daemon was restarted. Split brain == loss of communication. It does not matter whether communication was lost because node failed or because daemon was not running. There is no way for surviving node to know, *why* communication was lost. 2) How to resolve split brain state? Is it enough just to wait for failure, then - restart mysql by hand and clean row with dup index in slave db, and then run resource again? Or there is some automation for such cases? How are you sharing data? Can you give us a better understanding of your setup? Semi-synchronous MySQL replication, if you mean sharing DB log between nodes. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] MySQL, Percona replication manager - split brain
26.10.2014 08:32, Andrei Borzenkov пишет: В Sat, 25 Oct 2014 23:34:54 +0300 Andrew ni...@seti.kr.ua пишет: 25.10.2014 22:34, Digimer пишет: On 25/10/14 03:32 PM, Andrew wrote: Hi all. I use Percona as RA on cluster (nothing mission-critical, currently - just zabbix data); today after restarting MySQL resource (crm resource restart p_mysql) I've got a split brain state - MySQL for some reason started first at ex-slave node, ex-master starts later (possibly I've set too small timeout to shutdown - only 120s, but I'm not sure). After restart resource on both nodes it seems like mysql replication was ok - but then after ~50min it fails in split brain again for unknown reason (no resource restart was noticed). In 'show replication status' there is an error in table caused by unique index dup. So I have a questions: 1) Which thing causes split brain, and how to avoid it in future? Cause: Logs? ct 25 13:54:13 node2 crmd[29248]: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Oct 25 13:54:13 node2 pengine[29247]: notice: unpack_config: On loss of CCM Quorum: Ignore Oct 25 13:54:13 node2 pengine[29247]: notice: unpack_rsc_op: Operation monitor found resource p_pgsql:0 active in master mode on node1.cluster Oct 25 13:54:13 node2 pengine[29247]: notice: unpack_rsc_op: Operation monitor found resource p_mysql:1 active in master mode on node2.cluster That seems too late. The real cause is that resource was reported as being in master state on both nodes and this happened earlier. This is a different resources (pgsql and mysql)/ Prevent: Fencing (aka stonith). This is why fencing is required. No node failure. Just daemon was restarted. Split brain == loss of communication. It does not matter whether communication was lost because node failed or because daemon was not running. There is no way for surviving node to know, *why* communication was lost. So how stonith will help in this case? Daemon will be restarted after it's death if it occures during restart, and stonith will see alive daemon... So what is the easiest split-brain solution? Just to stop daemons, and copy all mysql data from good node to bad one? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] split brain - after network recovery - resources can still be migrated
On Sat, 25 Oct 2014 19:11:02 -0400 Digimer li...@alteeve.ca wrote: On 25/10/14 06:35 PM, Vladimir wrote: On Sat, 25 Oct 2014 17:30:07 -0400 Digimer li...@alteeve.ca wrote: On 25/10/14 05:09 PM, Vladimir wrote: Hi, currently I'm testing a 2 node setup using ubuntu trusty. # The scenario: All communication links betwenn the 2 nodes are cut off. This results in a split brain situation and both nodes take their resources online. When the communication links get back, I see following behaviour: On drbd level the split brain is detected and the device is disconnected on both nodes because of after-sb-2pri disconnect and then it goes to StandAlone ConnectionState. I'm wondering why pacemaker does not let the resources fail. It is still possible to migrate resources between the nodes although they're in StandAlone ConnectionState. After a split brain that's not what I want. Is this the expected behaviour? Is it possible to let the resources fail after the network recovery to avoid fürther data corruption. (At the moment I can't use resource or node level fencing in my setup.) Here the main part of my config: # dpkg -l | awk '$2 ~ /^(pacem|coro|drbd|libqb)/{print $2,$3}' corosync 2.3.3-1ubuntu1 drbd8-utils 2:8.4.4-1ubuntu1 libqb-dev 0.16.0.real-1ubuntu3 libqb0 0.16.0.real-1ubuntu3 pacemaker 1.1.10+git20130802-1ubuntu2.1 pacemaker-cli-utils 1.1.10+git20130802-1ubuntu2.1 # pacemaker primitive drbd-mysql ocf:linbit:drbd \ params drbd_resource=mysql \ op monitor interval=29s role=Master \ op monitor interval=30s role=Slave ms ms-drbd-mysql drbd-mysql \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true Split-brains are prevented by using reliable fencing (aka stonith). You configure stonith in pacemaker (using IPMI/iRMC/iLO/etc, switched PDUs, etc). Then you configure DRBD to use the crm-fence-peer.sh fence-handler and you set the fencing policy to 'resource-and-stonith;'. This way, if all links fail, both nodes block and call a fence. The faster one fences (powers off) the slower, and then it begins recovery, assured that the peer is not doing the same. Without stonith/fencing, then there is no defined behaviour. You will get split-brains and that is that. Consider; Both nodes lose contact with it's peer. Without fencing, both must assume the peer is dead and thus take over resources. That split brains can occur in such a setup that's clear. But I would expect pacemaker to stop the drbd resource when the link between the cluster nodes is reestablished instead of continue running it. DRBD will refuse to reconnect until it is told which node's data to delete. This is data loss and can not be safely automated. Sorry if maybe described it unclear but I don't want pacemaker to do an automatic split brain recovery. That would not make any sense to me either. This decission has to be taken by an administrator. But is it possible to configure pacemaker to do the following? - if there are 2 Nodes which can see and communicate with each other AND - if their disk state is not UpToDate/UpToDate (typically after a split brain) - then let drbd resource fail because something is obviously broken and an administrator has to decide how to continue. This is why stonith is required in clusters. Even with quorum, you can't assume anything about the state of the peer until it is fenced, so it would only give you a false sense of security. Maybe I'll can use resource level fencing. You need node-level fencing. I know node level fencing is more secure. But shouldn't resource level fencing also work here? e.g. (http://www.drbd.org/users-guide/s-pacemaker-fencing.html) Currently I can't use ipmi, apc switches or a shared storage device for fencing at most fencing via ssh. But what I've read this is also not recommended for production setups. Thanks in advance. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] MySQL, Percona replication manager - split brain
В Sun, 26 Oct 2014 10:51:13 +0200 Andrew ni...@seti.kr.ua пишет: 26.10.2014 08:32, Andrei Borzenkov пишет: В Sat, 25 Oct 2014 23:34:54 +0300 Andrew ni...@seti.kr.ua пишет: 25.10.2014 22:34, Digimer пишет: On 25/10/14 03:32 PM, Andrew wrote: Hi all. I use Percona as RA on cluster (nothing mission-critical, currently - just zabbix data); today after restarting MySQL resource (crm resource restart p_mysql) I've got a split brain state - MySQL for some reason started first at ex-slave node, ex-master starts later (possibly I've set too small timeout to shutdown - only 120s, but I'm not sure). Your logs do not show resource restart - they show pacemaker restart on node2. After restart resource on both nodes it seems like mysql replication was ok - but then after ~50min it fails in split brain again for unknown reason (no resource restart was noticed). In 'show replication status' there is an error in table caused by unique index dup. So I have a questions: 1) Which thing causes split brain, and how to avoid it in future? Cause: Logs? ct 25 13:54:13 node2 crmd[29248]: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Oct 25 13:54:13 node2 pengine[29247]: notice: unpack_config: On loss of CCM Quorum: Ignore Oct 25 13:54:13 node2 pengine[29247]: notice: unpack_rsc_op: Operation monitor found resource p_pgsql:0 active in master mode on node1.cluster Oct 25 13:54:13 node2 pengine[29247]: notice: unpack_rsc_op: Operation monitor found resource p_mysql:1 active in master mode on node2.cluster That seems too late. The real cause is that resource was reported as being in master state on both nodes and this happened earlier. This is a different resources (pgsql and mysql)/ Prevent: Fencing (aka stonith). This is why fencing is required. No node failure. Just daemon was restarted. Split brain == loss of communication. It does not matter whether communication was lost because node failed or because daemon was not running. There is no way for surviving node to know, *why* communication was lost. So how stonith will help in this case? Daemon will be restarted after it's death if it occures during restart, and stonith will see alive daemon... So what is the easiest split-brain solution? Just to stop daemons, and copy all mysql data from good node to bad one? There is no split-brain visible in your log. Pacemaker on node2 was restarted, cleanly as far as I can tell, and reintegrated back in cluster. May be node1 lost node2, but that needs logs from node1. You probably misuse split brain in this case. Split-brain means - nodes lost communication with each other, so each node is unaware of in which state resources on other node are. Here nodes means corosync/pacemaker. Not individual resources. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] split brain - after network recovery - resources can still be migrated
В Sun, 26 Oct 2014 12:01:03 +0100 Vladimir m...@foomx.de пишет: On Sat, 25 Oct 2014 19:11:02 -0400 Digimer li...@alteeve.ca wrote: On 25/10/14 06:35 PM, Vladimir wrote: On Sat, 25 Oct 2014 17:30:07 -0400 Digimer li...@alteeve.ca wrote: On 25/10/14 05:09 PM, Vladimir wrote: Hi, currently I'm testing a 2 node setup using ubuntu trusty. # The scenario: All communication links betwenn the 2 nodes are cut off. This results in a split brain situation and both nodes take their resources online. When the communication links get back, I see following behaviour: On drbd level the split brain is detected and the device is disconnected on both nodes because of after-sb-2pri disconnect and then it goes to StandAlone ConnectionState. I'm wondering why pacemaker does not let the resources fail. It is still possible to migrate resources between the nodes although they're in StandAlone ConnectionState. After a split brain that's not what I want. Is this the expected behaviour? Is it possible to let the resources fail after the network recovery to avoid fürther data corruption. (At the moment I can't use resource or node level fencing in my setup.) Here the main part of my config: # dpkg -l | awk '$2 ~ /^(pacem|coro|drbd|libqb)/{print $2,$3}' corosync 2.3.3-1ubuntu1 drbd8-utils 2:8.4.4-1ubuntu1 libqb-dev 0.16.0.real-1ubuntu3 libqb0 0.16.0.real-1ubuntu3 pacemaker 1.1.10+git20130802-1ubuntu2.1 pacemaker-cli-utils 1.1.10+git20130802-1ubuntu2.1 # pacemaker primitive drbd-mysql ocf:linbit:drbd \ params drbd_resource=mysql \ op monitor interval=29s role=Master \ op monitor interval=30s role=Slave ms ms-drbd-mysql drbd-mysql \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true Split-brains are prevented by using reliable fencing (aka stonith). You configure stonith in pacemaker (using IPMI/iRMC/iLO/etc, switched PDUs, etc). Then you configure DRBD to use the crm-fence-peer.sh fence-handler and you set the fencing policy to 'resource-and-stonith;'. This way, if all links fail, both nodes block and call a fence. The faster one fences (powers off) the slower, and then it begins recovery, assured that the peer is not doing the same. Without stonith/fencing, then there is no defined behaviour. You will get split-brains and that is that. Consider; Both nodes lose contact with it's peer. Without fencing, both must assume the peer is dead and thus take over resources. That split brains can occur in such a setup that's clear. But I would expect pacemaker to stop the drbd resource when the link between the cluster nodes is reestablished instead of continue running it. DRBD will refuse to reconnect until it is told which node's data to delete. This is data loss and can not be safely automated. Sorry if maybe described it unclear but I don't want pacemaker to do an automatic split brain recovery. That would not make any sense to me either. This decission has to be taken by an administrator. But is it possible to configure pacemaker to do the following? - if there are 2 Nodes which can see and communicate with each other AND - if their disk state is not UpToDate/UpToDate (typically after a split brain) - then let drbd resource fail because something is obviously broken and an administrator has to decide how to continue. This would require resource agent support. But it looks like current resource agent relies on fencing to resolve split brain situation. As long as resource agent itself does not indicate resource failure, there is nothing pacemaker can do. This is why stonith is required in clusters. Even with quorum, you can't assume anything about the state of the peer until it is fenced, so it would only give you a false sense of security. Maybe I'll can use resource level fencing. You need node-level fencing. I know node level fencing is more secure. But shouldn't resource level fencing also work here? e.g. (http://www.drbd.org/users-guide/s-pacemaker-fencing.html) Currently I can't use ipmi, apc switches or a shared storage device for fencing at most fencing via ssh. But what I've read this is also not recommended for production setups. You could try meatware stonith agent. This does exactly what you want - it freezes further processing unless administrator manually declares one node as down. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] split brain - after network recovery - resources can still be migrated
On 26/10/14 12:32 PM, Andrei Borzenkov wrote: В Sun, 26 Oct 2014 12:01:03 +0100 Vladimir m...@foomx.de пишет: On Sat, 25 Oct 2014 19:11:02 -0400 Digimer li...@alteeve.ca wrote: On 25/10/14 06:35 PM, Vladimir wrote: On Sat, 25 Oct 2014 17:30:07 -0400 Digimer li...@alteeve.ca wrote: On 25/10/14 05:09 PM, Vladimir wrote: Hi, currently I'm testing a 2 node setup using ubuntu trusty. # The scenario: All communication links betwenn the 2 nodes are cut off. This results in a split brain situation and both nodes take their resources online. When the communication links get back, I see following behaviour: On drbd level the split brain is detected and the device is disconnected on both nodes because of after-sb-2pri disconnect and then it goes to StandAlone ConnectionState. I'm wondering why pacemaker does not let the resources fail. It is still possible to migrate resources between the nodes although they're in StandAlone ConnectionState. After a split brain that's not what I want. Is this the expected behaviour? Is it possible to let the resources fail after the network recovery to avoid fürther data corruption. (At the moment I can't use resource or node level fencing in my setup.) Here the main part of my config: # dpkg -l | awk '$2 ~ /^(pacem|coro|drbd|libqb)/{print $2,$3}' corosync 2.3.3-1ubuntu1 drbd8-utils 2:8.4.4-1ubuntu1 libqb-dev 0.16.0.real-1ubuntu3 libqb0 0.16.0.real-1ubuntu3 pacemaker 1.1.10+git20130802-1ubuntu2.1 pacemaker-cli-utils 1.1.10+git20130802-1ubuntu2.1 # pacemaker primitive drbd-mysql ocf:linbit:drbd \ params drbd_resource=mysql \ op monitor interval=29s role=Master \ op monitor interval=30s role=Slave ms ms-drbd-mysql drbd-mysql \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true Split-brains are prevented by using reliable fencing (aka stonith). You configure stonith in pacemaker (using IPMI/iRMC/iLO/etc, switched PDUs, etc). Then you configure DRBD to use the crm-fence-peer.sh fence-handler and you set the fencing policy to 'resource-and-stonith;'. This way, if all links fail, both nodes block and call a fence. The faster one fences (powers off) the slower, and then it begins recovery, assured that the peer is not doing the same. Without stonith/fencing, then there is no defined behaviour. You will get split-brains and that is that. Consider; Both nodes lose contact with it's peer. Without fencing, both must assume the peer is dead and thus take over resources. That split brains can occur in such a setup that's clear. But I would expect pacemaker to stop the drbd resource when the link between the cluster nodes is reestablished instead of continue running it. DRBD will refuse to reconnect until it is told which node's data to delete. This is data loss and can not be safely automated. Sorry if maybe described it unclear but I don't want pacemaker to do an automatic split brain recovery. That would not make any sense to me either. This decission has to be taken by an administrator. But is it possible to configure pacemaker to do the following? - if there are 2 Nodes which can see and communicate with each other AND - if their disk state is not UpToDate/UpToDate (typically after a split brain) - then let drbd resource fail because something is obviously broken and an administrator has to decide how to continue. This would require resource agent support. But it looks like current resource agent relies on fencing to resolve split brain situation. As long as resource agent itself does not indicate resource failure, there is nothing pacemaker can do. This is why stonith is required in clusters. Even with quorum, you can't assume anything about the state of the peer until it is fenced, so it would only give you a false sense of security. Maybe I'll can use resource level fencing. You need node-level fencing. I know node level fencing is more secure. But shouldn't resource level fencing also work here? e.g. (http://www.drbd.org/users-guide/s-pacemaker-fencing.html) Currently I can't use ipmi, apc switches or a shared storage device for fencing at most fencing via ssh. But what I've read this is also not recommended for production setups. You could try meatware stonith agent. This does exactly what you want - it freezes further processing unless administrator manually declares one node as down. Support for this was dropped in the Red Hat world back in 2010ish because it was so easy for a panicy admin to clear the fence without adequately ensuring the peer node had actually been turned off. I *strongly* recommend against using manual fencing. If your nodes have IPMI, use that. It's super well tested. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] MySQL, Percona replication manager - split brain
26.10.2014 17:44, Andrei Borzenkov пишет: В Sun, 26 Oct 2014 10:51:13 +0200 Andrew ni...@seti.kr.ua пишет: 26.10.2014 08:32, Andrei Borzenkov пишет: В Sat, 25 Oct 2014 23:34:54 +0300 Andrew ni...@seti.kr.ua пишет: 25.10.2014 22:34, Digimer пишет: On 25/10/14 03:32 PM, Andrew wrote: Hi all. I use Percona as RA on cluster (nothing mission-critical, currently - just zabbix data); today after restarting MySQL resource (crm resource restart p_mysql) I've got a split brain state - MySQL for some reason started first at ex-slave node, ex-master starts later (possibly I've set too small timeout to shutdown - only 120s, but I'm not sure). Your logs do not show resource restart - they show pacemaker restart on node2. Yes, you're right. This was a pacemaker restart. After restart resource on both nodes it seems like mysql replication was ok - but then after ~50min it fails in split brain again for unknown reason (no resource restart was noticed). In 'show replication status' there is an error in table caused by unique index dup. So I have a questions: 1) Which thing causes split brain, and how to avoid it in future? Cause: Logs? ct 25 13:54:13 node2 crmd[29248]: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Oct 25 13:54:13 node2 pengine[29247]: notice: unpack_config: On loss of CCM Quorum: Ignore Oct 25 13:54:13 node2 pengine[29247]: notice: unpack_rsc_op: Operation monitor found resource p_pgsql:0 active in master mode on node1.cluster Oct 25 13:54:13 node2 pengine[29247]: notice: unpack_rsc_op: Operation monitor found resource p_mysql:1 active in master mode on node2.cluster That seems too late. The real cause is that resource was reported as being in master state on both nodes and this happened earlier. This is a different resources (pgsql and mysql)/ Prevent: Fencing (aka stonith). This is why fencing is required. No node failure. Just daemon was restarted. Split brain == loss of communication. It does not matter whether communication was lost because node failed or because daemon was not running. There is no way for surviving node to know, *why* communication was lost. So how stonith will help in this case? Daemon will be restarted after it's death if it occures during restart, and stonith will see alive daemon... So what is the easiest split-brain solution? Just to stop daemons, and copy all mysql data from good node to bad one? There is no split-brain visible in your log. Pacemaker on node2 was restarted, cleanly as far as I can tell, and reintegrated back in cluster. May be node1 lost node2, but that needs logs from node1. Here is log from other node: Oct 25 13:54:13 node1 pacemakerd[21773]: notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log Oct 25 13:54:13 node1 attrd[26079]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-p_mysql (56) Oct 25 13:54:13 node1 attrd[26079]: notice: attrd_perform_update: Sent update 6993: master-p_mysql=56 Oct 25 13:54:16 node1 attrd[26079]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-p_mysql (53) Oct 25 13:54:16 node1 attrd[26079]: notice: attrd_perform_update: Sent update 6995: master-p_mysql=53 Oct 25 13:54:16 node1 pacemakerd[22035]: notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log Oct 25 13:54:18 node1 attrd[26079]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-p_mysql (60) Oct 25 13:54:18 node1 attrd[26079]: notice: attrd_perform_update: Sent update 6997: master-p_mysql=60 Oct 25 13:54:18 node1 pacemakerd[22335]: notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log Oct 25 13:54:19 node1 pacemakerd[22476]: notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log Oct 25 13:54:19 node1 mysql(p_mysql)[22446]: INFO: Ignoring post-demote notification execpt for my own demotion. Oct 25 13:54:19 node1 crmd[26081]: notice: process_lrm_event: LRM operation p_mysql_notify_0 (call=2423, rc=0, cib-update=0, confirmed=true) ok Oct 25 13:54:19 node1 crmd[26081]: notice: process_lrm_event: LRM operation p_pgsql_notify_0 (call=2425, rc=0, cib-update=0, confirmed=true) ok Oct 25 13:54:20 node1 pacemakerd[22597]: notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log Oct 25 13:54:20 node1 mysql(p_mysql)[22540]: INFO: Ignoring post-demote notification execpt for my own demotion. Oct 25 13:54:20 node1 crmd[26081]: notice: process_lrm_event: LRM operation p_mysql_notify_0 (call=2433, rc=0, cib-update=0, confirmed=true) ok Oct 25 13:54:20 node1 IPaddr(ClusterIP)[22538]: INFO: Adding inet address 192.168.253.254/24 with broadcast address 192.168.253.255 to device br0 Oct 25 13:54:20 node1 IPaddr2(pgsql_reader_vip)[22539]: INFO: Adding inet address 192.168.253.31/24
Re: [Pacemaker] MySQL, Percona replication manager - split brain
Please send log files as attachments. The line wrapping makes them very hard to read inline. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Stopping/restarting pacemaker without stopping resources?
On 25 Oct 2014, at 12:38 am, Andrei Borzenkov arvidj...@gmail.com wrote: On Fri, Oct 24, 2014 at 9:17 AM, Andrew Beekhof and...@beekhof.net wrote: On 16 Oct 2014, at 9:31 pm, Andrei Borzenkov arvidj...@gmail.com wrote: The primary goal is to transparently update software in cluster. I just did HA suite update using simple RPM and observed that RPM attempts to restart stack (rcopenais try-restart). So a) if it worked, it would mean resources had been migrated from this node - interruption b) it did not work - apparently new versions of installed utils were incompatible with running pacemaker so request to shutdown crm fails and openais hung forever. The usual workflow with one cluster products I worked before was - stop cluster processes without stopping resources; update; restart cluster processes. They would detect that resources are started and return to the same state as before stopping. Is something like this possible with pacemaker? absolutely. this should be of some help: http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_disconnect_and_reattach.html Did not work. It ended up moving master to another node and leaving slave on original node stopped after that. When you stopped the cluster or when you started it after an upgrade? I have crm_report if this is not expected ... pacemaker Version: 1.1.11-3ca8c3b ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] CentOS 6 - after update pacemaker floods log with warnings
Someone is calling pacemakerd over and over and over. Don't do that. On 26 Oct 2014, at 7:35 am, Andrew ni...@seti.kr.ua wrote: Hi all. After upgrade CentOS to current (Pacemaker 1.1.8-7.el6 to 1.1.10-14.el6_5.3), Pacemaker produces tonns of logs. Near 20GB per day. What may cause this behavior? Running config: node node2.cluster \ attributes p_mysql_mysql_master_IP=192.168.253.4 \ attributes p_pgsql-data-status=STREAMING|SYNC node node1.cluster \ attributes p_mysql_mysql_master_IP=192.168.253.5 \ attributes p_pgsql-data-status=LATEST primitive ClusterIP ocf:heartbeat:IPaddr \ params ip=192.168.253.254 nic=br0 cidr_netmask=24 \ op monitor interval=2s \ meta target-role=Started primitive mysql_reader_vip ocf:heartbeat:IPaddr2 \ params ip=192.168.253.63 nic=br0 cidr_netmask=24 \ op monitor interval=10s \ meta target-role=Started primitive mysql_writer_vip ocf:heartbeat:IPaddr2 \ params ip=192.168.253.64 nic=br0 cidr_netmask=24 \ op monitor interval=10s \ meta target-role=Started primitive p_mysql ocf:percona:mysql \ params config=/etc/my.cnf pid=/var/lib/mysql/mysqld.pid socket=/var/run/mysqld/mysqld.sock replication_user=***user*** replication_passwd=***passwd*** max_slave_lag=60 evict_outdated_slaves=false binary=/usr/libexec/mysqld test_user=***user*** test_passwd=***password*** enable_creation=true \ op monitor interval=5s role=Master timeout=30s OCF_CHECK_LEVEL=1 \ op monitor interval=2s role=Slave timeout=30s OCF_CHECK_LEVEL=1 \ op start interval=0 timeout=120s \ op stop interval=0 timeout=120s primitive p_nginx ocf:heartbeat:nginx \ params configfile=/etc/nginx/nginx.conf httpd=/usr/sbin/nginx \ op start interval=0 timeout=60s on-fail=restart \ op monitor interval=10s timeout=30s on-fail=restart depth=0 \ op monitor interval=30s timeout=30s on-fail=restart depth=10 \ op stop interval=0 timeout=120s primitive p_perl-fpm ocf:fresh:daemon \ params binfile=/usr/local/bin/perl-fpm cmdline_options=-u nginx -g nginx -x 180 -t 16 -d -P /var/run/perl-fpm/perl-fpm.pid pidfile=/var/run/perl-fpm/perl-fpm.pid \ op start interval=0 timeout=30s \ op monitor interval=10 timeout=20s depth=0 \ op stop interval=0 timeout=30s primitive p_pgsql ocf:fresh:pgsql \ params pgctl=/usr/pgsql-9.1/bin/pg_ctl psql=/usr/pgsql-9.1/bin/psql pgdata=/var/lib/pgsql/9.1/data/ start_opt=-p 5432 rep_mode=sync node_list=node2.cluster node1.cluster restore_command=cp /var/lib/pgsql/9.1/wal_archive/%f %p primary_conninfo_opt=keepalives_idle=60 keepalives_interval=5 keepalives_count=5 password=***passwd*** repuser=***user*** master_ip=192.168.253.32 stop_escalate=0 \ op start interval=0 timeout=120s on-fail=restart \ op monitor interval=7s timeout=60s on-fail=restart \ op monitor interval=2s role=Master timeout=60s on-fail=restart \ op promote interval=0 timeout=120s on-fail=restart \ op demote interval=0 timeout=120s on-fail=stop \ op stop interval=0 timeout=120s on-fail=block \ op notify interval=0 timeout=90s primitive p_radius_ip ocf:heartbeat:IPaddr2 \ params ip=10.255.0.33 nic=lo cidr_netmask=32 \ op monitor interval=10s primitive p_radiusd ocf:fresh:daemon \ params binfile=/usr/sbin/radiusd pidfile=/var/run/radiusd/radiusd.pid \ op start interval=0 timeout=30s \ op monitor interval=10 timeout=20s depth=0 \ op stop interval=0 timeout=30s primitive p_web_ip ocf:heartbeat:IPaddr2 \ params ip=10.255.0.32 nic=lo cidr_netmask=32 \ op monitor interval=10s primitive pgsql_reader_vip ocf:heartbeat:IPaddr2 \ params ip=192.168.253.31 nic=br0 cidr_netmask=24 \ meta resource-stickiness=1 \ op start interval=0 timeout=60s on-fail=restart \ op monitor interval=10s timeout=60s on-fail=restart \ op stop interval=0 timeout=60s on-fail=block primitive pgsql_writer_vip ocf:heartbeat:IPaddr2 \ params ip=192.168.253.32 nic=br0 cidr_netmask=24 \ meta migration-threshold=0 \ op start interval=0 timeout=60s on-fail=restart \ op monitor interval=10s timeout=60s on-fail=restart \ op stop interval=0 timeout=60s on-fail=block group gr_http p_nginx p_perl-fpm ms ms_MySQL p_mysql \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true globally-unique=false target-role=Started ms ms_Postgresql p_pgsql \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started clone cl_http gr_http \ meta clone-max=2 clone-node-max=1 target-role=Started clone cl_radiusd p_radiusd \ meta clone-max=2 clone-node-max=1 target-role=Started location loc-mysql-no-reader-vip mysql_reader_vip \ rule $id=loc-mysql-no-reader-vip-rule -inf: readable eq 0 \ rule $id=loc-mysql-no-reader-vip-rule-0 -inf: not_defined readable location mysql_master_location ms_MySQL \ rule $id=mysql_master_location-rule $role=master 110: #uname eq
Re: [Pacemaker] DRBD with Pacemaker on CentOs 6.5
Hi Andrew, Logs in /var/log/httpd/ are empty, but here's a snippet of /var/log/messages right after I start pacemaker and do a crm status http://pastebin.com/ivQdyV4u Seems like the Apache service doesn't come up. This only happens after I run the commands in the guide to configure DRBD. On Fri, Oct 24, 2014 at 8:29 AM, Andrew Beekhof and...@beekhof.net wrote: logs? On 23 Oct 2014, at 1:08 pm, Sihan Goi gois...@gmail.com wrote: Hi, can anyone help? Really stuck here... On Mon, Oct 20, 2014 at 9:46 AM, Sihan Goi gois...@gmail.com wrote: Hi, I'm following the Clusters from Scratch guide for Fedora 13, and I've managed to get a 2 node cluster working with Apache. However, once I tried to add DRBD 8.4 to the mix, it stopped working. I've followed the DRBD steps in the guide all the way till cib commit fs in Section 7.4, right before Testing Migration. However, when I do a crm_mon, I get the following failed actions. Last updated: Thu Oct 16 17:28:34 2014 Last change: Thu Oct 16 17:26:04 2014 via crm_shadow on node01 Stack: cman Current DC: node02 - partition with quorum Version: 1.1.10-14.el6_5.3-368c726 2 Nodes configured 5 Resources configured Online: [ node01 node02 ] ClusterIP(ocf::heartbeat:IPaddr2):Started node02 Master/Slave Set: WebDataClone [WebData] Masters: [ node02 ] Slaves: [ node01 ] WebFS (ocf::heartbeat:Filesystem):Started node02 Failed actions: WebSite_start_0 on node02 'unknown error' (1): call=278, status=Timed Out, last-rc-change='Thu Oct 16 17:26:28 2014', queued=2ms, exec=0ms WebSite_start_0 on node01 'unknown error' (1): call=203, status=Timed Out, last-rc-change='Thu Oct 16 17:26:09 2014', queued=2ms, exec=0ms Seems like the apache Website resource isn't starting up. Apache was working just fine before I configured DRBD. What did I do wrong? -- - Goi Sihan gois...@gmail.com -- - Goi Sihan gois...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- - Goi Sihan gois...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Stopping/restarting pacemaker without stopping resources?
В Mon, 27 Oct 2014 11:09:08 +1100 Andrew Beekhof and...@beekhof.net пишет: On 25 Oct 2014, at 12:38 am, Andrei Borzenkov arvidj...@gmail.com wrote: On Fri, Oct 24, 2014 at 9:17 AM, Andrew Beekhof and...@beekhof.net wrote: On 16 Oct 2014, at 9:31 pm, Andrei Borzenkov arvidj...@gmail.com wrote: The primary goal is to transparently update software in cluster. I just did HA suite update using simple RPM and observed that RPM attempts to restart stack (rcopenais try-restart). So a) if it worked, it would mean resources had been migrated from this node - interruption b) it did not work - apparently new versions of installed utils were incompatible with running pacemaker so request to shutdown crm fails and openais hung forever. The usual workflow with one cluster products I worked before was - stop cluster processes without stopping resources; update; restart cluster processes. They would detect that resources are started and return to the same state as before stopping. Is something like this possible with pacemaker? absolutely. this should be of some help: http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_disconnect_and_reattach.html Did not work. It ended up moving master to another node and leaving slave on original node stopped after that. When you stopped the cluster or when you started it after an upgrade? When I started it crm_attribute -t crm_config -n is-managed-default -v false rcopenais stop on both nodes rcopenais start on both node; wait for them to stabilize crm_attribute -t crm_config -n is-managed-default -v true It stopped running master/slave, moved master and left slave stopped. I have crm_report if this is not expected ... pacemaker Version: 1.1.11-3ca8c3b ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Stopping/restarting pacemaker without stopping resources?
On 27 Oct 2014, at 2:30 pm, Andrei Borzenkov arvidj...@gmail.com wrote: В Mon, 27 Oct 2014 11:09:08 +1100 Andrew Beekhof and...@beekhof.net пишет: On 25 Oct 2014, at 12:38 am, Andrei Borzenkov arvidj...@gmail.com wrote: On Fri, Oct 24, 2014 at 9:17 AM, Andrew Beekhof and...@beekhof.net wrote: On 16 Oct 2014, at 9:31 pm, Andrei Borzenkov arvidj...@gmail.com wrote: The primary goal is to transparently update software in cluster. I just did HA suite update using simple RPM and observed that RPM attempts to restart stack (rcopenais try-restart). So a) if it worked, it would mean resources had been migrated from this node - interruption b) it did not work - apparently new versions of installed utils were incompatible with running pacemaker so request to shutdown crm fails and openais hung forever. The usual workflow with one cluster products I worked before was - stop cluster processes without stopping resources; update; restart cluster processes. They would detect that resources are started and return to the same state as before stopping. Is something like this possible with pacemaker? absolutely. this should be of some help: http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_disconnect_and_reattach.html Did not work. It ended up moving master to another node and leaving slave on original node stopped after that. When you stopped the cluster or when you started it after an upgrade? When I started it crm_attribute -t crm_config -n is-managed-default -v false rcopenais stop on both nodes rcopenais start on both node; wait for them to stabilize crm_attribute -t crm_config -n is-managed-default -v true It stopped running master/slave, moved master and left slave stopped. What did crm_mon say before you set is-managed-default back to true? Did the resource agent properly detect it as running in the master state? Did the resource agent properly (re)set a preference for being promoted during the initial monitor operation? Pacemaker can do it, but it is dependant on the resources behaving correctly. I have crm_report if this is not expected ... pacemaker Version: 1.1.11-3ca8c3b ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] DRBD with Pacemaker on CentOs 6.5
Oct 27 10:28:44 node02 apache(WebSite)[10515]: ERROR: Syntax error on line 292 of /etc/httpd/conf/httpd.conf: DocumentRoot must be a directory On 27 Oct 2014, at 1:36 pm, Sihan Goi gois...@gmail.com wrote: Hi Andrew, Logs in /var/log/httpd/ are empty, but here's a snippet of /var/log/messages right after I start pacemaker and do a crm status http://pastebin.com/ivQdyV4u Seems like the Apache service doesn't come up. This only happens after I run the commands in the guide to configure DRBD. On Fri, Oct 24, 2014 at 8:29 AM, Andrew Beekhof and...@beekhof.net wrote: logs? On 23 Oct 2014, at 1:08 pm, Sihan Goi gois...@gmail.com wrote: Hi, can anyone help? Really stuck here... On Mon, Oct 20, 2014 at 9:46 AM, Sihan Goi gois...@gmail.com wrote: Hi, I'm following the Clusters from Scratch guide for Fedora 13, and I've managed to get a 2 node cluster working with Apache. However, once I tried to add DRBD 8.4 to the mix, it stopped working. I've followed the DRBD steps in the guide all the way till cib commit fs in Section 7.4, right before Testing Migration. However, when I do a crm_mon, I get the following failed actions. Last updated: Thu Oct 16 17:28:34 2014 Last change: Thu Oct 16 17:26:04 2014 via crm_shadow on node01 Stack: cman Current DC: node02 - partition with quorum Version: 1.1.10-14.el6_5.3-368c726 2 Nodes configured 5 Resources configured Online: [ node01 node02 ] ClusterIP(ocf::heartbeat:IPaddr2):Started node02 Master/Slave Set: WebDataClone [WebData] Masters: [ node02 ] Slaves: [ node01 ] WebFS (ocf::heartbeat:Filesystem):Started node02 Failed actions: WebSite_start_0 on node02 'unknown error' (1): call=278, status=Timed Out, last-rc-change='Thu Oct 16 17:26:28 2014', queued=2ms, exec=0ms WebSite_start_0 on node01 'unknown error' (1): call=203, status=Timed Out, last-rc-change='Thu Oct 16 17:26:09 2014', queued=2ms, exec=0ms Seems like the apache Website resource isn't starting up. Apache was working just fine before I configured DRBD. What did I do wrong? -- - Goi Sihan gois...@gmail.com -- - Goi Sihan gois...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- - Goi Sihan gois...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] communications problems in cluster
I know David has been putting a lot of time into the pacemaker-remote stuff lately, its quite possible that you're hitting a bug on our side. Is trying out the latest from Git an option? Making rpms is pretty easy, just 'make rpm-dep rpm' should be enough. On 14 Oct 2014, at 1:31 am, Саша Александров shurr...@gmail.com wrote: Hi! Most likely related... I have node vm-vmwww with remote-node vmwww. Both are reported online (vmwww:vm-vmwww) and vm-vmwww is reported as 'started on wings1'. However, when I try to cleanup faulty failed action vmwww_start_0 on wings1 'unknown error' (1): call=100, status=Timed Out , here is what I get in the log: Oct 13 18:25:43 wings1 crmd[3844]: warning: qb_ipcs_event_sendv: new_event_notification (3844-18918-16): Broken pipe (32) Oct 13 18:25:43 wings1 crmd[3844]:error: do_lrm_invoke: no lrmd connection for remote node vmwww found on cluster node wings1. Can not process request. Oct 13 18:25:43 wings1 crmd[3844]:error: send_msg_via_ipc: Unknown Sub-system (d483a600-5535-4f0d-8ffd-2af391f5cb21)... discarding message. Oct 13 18:25:43 wings1 crmd[3844]:error: send_msg_via_ipc: Unknown Sub-system (d483a600-5535-4f0d-8ffd-2af391f5cb21)... discarding message. Oct 13 18:25:43 wings1 crmd[3844]:error: send_msg_via_ipc: Unknown Sub-system (d483a600-5535-4f0d-8ffd-2af391f5cb21)... discarding message. Oct 13 18:25:43 wings1 crmd[3844]:error: send_msg_via_ipc: Unknown Sub-system (d483a600-5535-4f0d-8ffd-2af391f5cb21)... discarding message. I go to the VM, and try to run 'crm_mon': Oct 13 18:27:06 vmwww pacemaker_remoted[3798]:error: ipc_proxy_accept: No ipc providers available for uid 0 gid 0 Oct 13 18:27:06 vmwww pacemaker_remoted[3798]:error: handle_new_connection: Error in connection setup (3798-3868-13): Remote I/O error (121) ps aux | grep pace root 3798 0.1 0.1 76396 2868 ?S18:16 0:00 pacemaker_remoted netstat -nltp | grep 3121 tcp0 0 0.0.0.0:31210.0.0.0:* LISTEN 3798/pacemaker_remo However I can telnet ok: [root@wings1 ~]# telnet vmwww 3121 Trying 192.168.222.89... Connected to vmwww. Escape character is '^]'. ^] telnet quit Connection closed. This is pretty weird... Best regards, Alex 2014-10-13 17:47 GMT+04:00 Саша Александров shurr...@gmail.com: Hi! I was building a cluster with pacemaker+pacemaker-remote (CentOS 6.5, everything from the official repo). While I had several resources, everything was fine. However, when I added more VMs (2 nodes and 10 VMs currently) I started to run into problems (see below). Strange thing is that when I start cman/pacemaker some time later - they seem to work fine for some time. Oct 13 17:03:54 wings1 pacemakerd[26440]: notice: pcmk_child_exit: Child process crmd terminated with signal 13 (pid=30010, core=0) Oct 13 17:03:54 wings1 lrmd[26448]: warning: qb_ipcs_event_sendv: new_event_notification (26448-30010-6): Bad file descriptor (9) Oct 13 17:03:54 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed Oct 13 17:03:54 wings1 pacemakerd[26440]: notice: pcmk_process_exit: Respawning failed child process: crmd Oct 13 17:03:54 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed Oct 13 17:03:54 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed Oct 13 17:03:54 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed Oct 13 17:03:54 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed Oct 13 17:03:54 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed Oct 13 17:03:57 wings1 pacemakerd[26440]: notice: pcmk_child_exit: Child process crmd terminated with signal 13 (pid=30603, core=0) Oct 13 17:03:57 wings1 lrmd[26448]: warning: qb_ipcs_event_sendv: new_event_notification (26448-30603-6): Bad file descriptor (9) Oct 13 17:03:57 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed Oct 13 17:03:57 wings1 pacemakerd[26440]: notice: pcmk_process_exit: Respawning failed child process: crmd Oct 13 17:03:57 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed Oct 13 17:03:57 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed Oct 13 17:03:57 wings1 lrmd[26448]: warning: send_client_notify: Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed Oct 13 17:03:57