Re: [Pacemaker] MySQL, Percona replication manager - split brain

2014-10-26 Thread Andrei Borzenkov
В Sat, 25 Oct 2014 23:34:54 +0300
Andrew ni...@seti.kr.ua пишет:

 25.10.2014 22:34, Digimer пишет:
  On 25/10/14 03:32 PM, Andrew wrote:
  Hi all.
 
  I use Percona as RA on cluster (nothing mission-critical, currently -
  just zabbix data); today after restarting MySQL resource (crm resource
  restart p_mysql) I've got a split brain state - MySQL for some reason
  started first at ex-slave node, ex-master starts later (possibly I've
  set too small timeout to shutdown - only 120s, but I'm not sure).
 
  After restart resource on both nodes it seems like mysql replication was
  ok - but then after ~50min it fails in split brain again for unknown
  reason (no resource restart was noticed).
 
  In 'show replication status' there is an error in table caused by unique
  index dup.
 
  So I have a questions:
  1) Which thing causes split brain, and how to avoid it in future?
 
  Cause:
 
  Logs?
 ct 25 13:54:13 node2 crmd[29248]:   notice: do_state_transition: State 
 transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC 
 cause=C_FSA_INTERNAL origin=abort_transition_graph ]
 Oct 25 13:54:13 node2 pengine[29247]:   notice: unpack_config: On loss 
 of CCM Quorum: Ignore
 Oct 25 13:54:13 node2 pengine[29247]:   notice: unpack_rsc_op: Operation 
 monitor found resource p_pgsql:0 active in master mode on node1.cluster
 Oct 25 13:54:13 node2 pengine[29247]:   notice: unpack_rsc_op: Operation 
 monitor found resource p_mysql:1 active in master mode on node2.cluster

That seems too late. The real cause is that resource was reported as
being in master state on both nodes and this happened earlier.
 
 
  Prevent:
 
  Fencing (aka stonith). This is why fencing is required.
 No node failure. Just daemon was restarted.
 

Split brain == loss of communication. It does not matter whether
communication was lost because node failed or because daemon was not
running. There is no way for surviving node to know, *why*
communication was lost.

 
  2) How to resolve split brain state? Is it enough just to wait for
  failure, then - restart mysql by hand and clean row with dup index in
  slave db, and then run resource again? Or there is some automation for
  such cases?
 
  How are you sharing data? Can you give us a better understanding of 
  your setup?
 
 Semi-synchronous MySQL replication, if you mean sharing DB log between 
 nodes.
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] MySQL, Percona replication manager - split brain

2014-10-26 Thread Andrew

26.10.2014 08:32, Andrei Borzenkov пишет:

В Sat, 25 Oct 2014 23:34:54 +0300
Andrew ni...@seti.kr.ua пишет:


25.10.2014 22:34, Digimer пишет:

On 25/10/14 03:32 PM, Andrew wrote:

Hi all.

I use Percona as RA on cluster (nothing mission-critical, currently -
just zabbix data); today after restarting MySQL resource (crm resource
restart p_mysql) I've got a split brain state - MySQL for some reason
started first at ex-slave node, ex-master starts later (possibly I've
set too small timeout to shutdown - only 120s, but I'm not sure).

After restart resource on both nodes it seems like mysql replication was
ok - but then after ~50min it fails in split brain again for unknown
reason (no resource restart was noticed).

In 'show replication status' there is an error in table caused by unique
index dup.

So I have a questions:
1) Which thing causes split brain, and how to avoid it in future?

Cause:

Logs?

ct 25 13:54:13 node2 crmd[29248]:   notice: do_state_transition: State
transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Oct 25 13:54:13 node2 pengine[29247]:   notice: unpack_config: On loss
of CCM Quorum: Ignore
Oct 25 13:54:13 node2 pengine[29247]:   notice: unpack_rsc_op: Operation
monitor found resource p_pgsql:0 active in master mode on node1.cluster
Oct 25 13:54:13 node2 pengine[29247]:   notice: unpack_rsc_op: Operation
monitor found resource p_mysql:1 active in master mode on node2.cluster

That seems too late. The real cause is that resource was reported as
being in master state on both nodes and this happened earlier.

This is a different resources (pgsql and mysql)/


Prevent:

Fencing (aka stonith). This is why fencing is required.

No node failure. Just daemon was restarted.


Split brain == loss of communication. It does not matter whether
communication was lost because node failed or because daemon was not
running. There is no way for surviving node to know, *why*
communication was lost.

So how stonith will help in this case? Daemon will be restarted after 
it's death if it occures during restart, and stonith will see alive 
daemon...


So what is the easiest split-brain solution? Just to stop daemons, and 
copy all mysql data from good node to bad one?


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] split brain - after network recovery - resources can still be migrated

2014-10-26 Thread Vladimir
On Sat, 25 Oct 2014 19:11:02 -0400
Digimer li...@alteeve.ca wrote:

 On 25/10/14 06:35 PM, Vladimir wrote:
  On Sat, 25 Oct 2014 17:30:07 -0400
  Digimer li...@alteeve.ca wrote:
 
  On 25/10/14 05:09 PM, Vladimir wrote:
  Hi,
 
  currently I'm testing a 2 node setup using ubuntu trusty.
 
  # The scenario:
 
  All communication links betwenn the 2 nodes are cut off. This
  results in a split brain situation and both nodes take their
  resources online.
 
  When the communication links get back, I see following behaviour:
 
  On drbd level the split brain is detected and the device is
  disconnected on both nodes because of after-sb-2pri disconnect
  and then it goes to StandAlone ConnectionState.
 
  I'm wondering why pacemaker does not let the resources fail.
  It is still possible to migrate resources between the nodes
  although they're in StandAlone ConnectionState. After a split
  brain that's not what I want.
 
  Is this the expected behaviour? Is it possible to let the
  resources fail after the network recovery to avoid fürther data
  corruption.
 
  (At the moment I can't use resource or node level fencing in my
  setup.)
 
  Here the main part of my config:
 
  # dpkg -l | awk '$2 ~ /^(pacem|coro|drbd|libqb)/{print $2,$3}'
  corosync 2.3.3-1ubuntu1
  drbd8-utils 2:8.4.4-1ubuntu1
  libqb-dev 0.16.0.real-1ubuntu3
  libqb0 0.16.0.real-1ubuntu3
  pacemaker 1.1.10+git20130802-1ubuntu2.1
  pacemaker-cli-utils 1.1.10+git20130802-1ubuntu2.1
 
  # pacemaker
  primitive drbd-mysql ocf:linbit:drbd \
  params drbd_resource=mysql \
  op monitor interval=29s role=Master \
  op monitor interval=30s role=Slave
 
  ms ms-drbd-mysql drbd-mysql \
  meta master-max=1 master-node-max=1 clone-max=2
  clone-node-max=1 notify=true
 
  Split-brains are prevented by using reliable fencing (aka stonith).
  You configure stonith in pacemaker (using IPMI/iRMC/iLO/etc,
  switched PDUs, etc). Then you configure DRBD to use the
  crm-fence-peer.sh fence-handler and you set the fencing policy to
  'resource-and-stonith;'.
 
  This way, if all links fail, both nodes block and call a fence. The
  faster one fences (powers off) the slower, and then it begins
  recovery, assured that the peer is not doing the same.
 
  Without stonith/fencing, then there is no defined behaviour. You
  will get split-brains and that is that. Consider; Both nodes lose
  contact with it's peer. Without fencing, both must assume the peer
  is dead and thus take over resources.
 
  That split brains can occur in such a setup that's clear. But I
  would expect pacemaker to stop the drbd resource when the link
  between the cluster nodes is reestablished instead of continue
  running it.
 
 DRBD will refuse to reconnect until it is told which node's data to 
 delete. This is data loss and can not be safely automated.

Sorry if maybe described it unclear but I don't want pacemaker to do an
automatic split brain recovery. That would not make any sense to me
either. This decission has to be taken by an administrator.

But is it possible to configure pacemaker to do the following?
 
- if there are 2 Nodes which can see and communicate with each other
  AND
- if their disk state is not UpToDate/UpToDate (typically after a split
  brain)
- then let drbd resource fail because something is obviously broken and
  an administrator has to decide how to continue.

  This is why stonith is required in clusters. Even with quorum, you
  can't assume anything about the state of the peer until it is
  fenced, so it would only give you a false sense of security.
 
  Maybe I'll can use resource level fencing.
 
 You need node-level fencing.

I know node level fencing is more secure. But shouldn't resource level
fencing also work here? e.g.
(http://www.drbd.org/users-guide/s-pacemaker-fencing.html) 

Currently I can't use ipmi, apc switches or a shared storage device
for fencing at most fencing via ssh. But what I've read this is also not
recommended for production setups.

Thanks in advance.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] MySQL, Percona replication manager - split brain

2014-10-26 Thread Andrei Borzenkov
В Sun, 26 Oct 2014 10:51:13 +0200
Andrew ni...@seti.kr.ua пишет:

 26.10.2014 08:32, Andrei Borzenkov пишет:
  В Sat, 25 Oct 2014 23:34:54 +0300
  Andrew ni...@seti.kr.ua пишет:
 
  25.10.2014 22:34, Digimer пишет:
  On 25/10/14 03:32 PM, Andrew wrote:
  Hi all.
 
  I use Percona as RA on cluster (nothing mission-critical, currently -
  just zabbix data); today after restarting MySQL resource (crm resource
  restart p_mysql) I've got a split brain state - MySQL for some reason
  started first at ex-slave node, ex-master starts later (possibly I've
  set too small timeout to shutdown - only 120s, but I'm not sure).
 

Your logs do not show resource restart - they show pacemaker restart on
node2.

  After restart resource on both nodes it seems like mysql replication was
  ok - but then after ~50min it fails in split brain again for unknown
  reason (no resource restart was noticed).
 
  In 'show replication status' there is an error in table caused by unique
  index dup.
 
  So I have a questions:
  1) Which thing causes split brain, and how to avoid it in future?
  Cause:
 
  Logs?
  ct 25 13:54:13 node2 crmd[29248]:   notice: do_state_transition: State
  transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC
  cause=C_FSA_INTERNAL origin=abort_transition_graph ]
  Oct 25 13:54:13 node2 pengine[29247]:   notice: unpack_config: On loss
  of CCM Quorum: Ignore
  Oct 25 13:54:13 node2 pengine[29247]:   notice: unpack_rsc_op: Operation
  monitor found resource p_pgsql:0 active in master mode on node1.cluster
  Oct 25 13:54:13 node2 pengine[29247]:   notice: unpack_rsc_op: Operation
  monitor found resource p_mysql:1 active in master mode on node2.cluster
  That seems too late. The real cause is that resource was reported as
  being in master state on both nodes and this happened earlier.
 This is a different resources (pgsql and mysql)/
 
  Prevent:
 
  Fencing (aka stonith). This is why fencing is required.
  No node failure. Just daemon was restarted.
 
  Split brain == loss of communication. It does not matter whether
  communication was lost because node failed or because daemon was not
  running. There is no way for surviving node to know, *why*
  communication was lost.
 
 So how stonith will help in this case? Daemon will be restarted after 
 it's death if it occures during restart, and stonith will see alive 
 daemon...
 
 So what is the easiest split-brain solution? Just to stop daemons, and 
 copy all mysql data from good node to bad one?

There is no split-brain visible in your log. Pacemaker on node2 was
restarted, cleanly as far as I can tell, and reintegrated back in
cluster. May be node1 lost node2, but that needs logs from node1.

You probably misuse split brain in this case. Split-brain means -
nodes lost communication with each other, so each node is unaware of
in which state resources on other node are. Here nodes means
corosync/pacemaker. Not individual resources.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] split brain - after network recovery - resources can still be migrated

2014-10-26 Thread Andrei Borzenkov
В Sun, 26 Oct 2014 12:01:03 +0100
Vladimir m...@foomx.de пишет:

 On Sat, 25 Oct 2014 19:11:02 -0400
 Digimer li...@alteeve.ca wrote:
 
  On 25/10/14 06:35 PM, Vladimir wrote:
   On Sat, 25 Oct 2014 17:30:07 -0400
   Digimer li...@alteeve.ca wrote:
  
   On 25/10/14 05:09 PM, Vladimir wrote:
   Hi,
  
   currently I'm testing a 2 node setup using ubuntu trusty.
  
   # The scenario:
  
   All communication links betwenn the 2 nodes are cut off. This
   results in a split brain situation and both nodes take their
   resources online.
  
   When the communication links get back, I see following behaviour:
  
   On drbd level the split brain is detected and the device is
   disconnected on both nodes because of after-sb-2pri disconnect
   and then it goes to StandAlone ConnectionState.
  
   I'm wondering why pacemaker does not let the resources fail.
   It is still possible to migrate resources between the nodes
   although they're in StandAlone ConnectionState. After a split
   brain that's not what I want.
  
   Is this the expected behaviour? Is it possible to let the
   resources fail after the network recovery to avoid fürther data
   corruption.
  
   (At the moment I can't use resource or node level fencing in my
   setup.)
  
   Here the main part of my config:
  
   # dpkg -l | awk '$2 ~ /^(pacem|coro|drbd|libqb)/{print $2,$3}'
   corosync 2.3.3-1ubuntu1
   drbd8-utils 2:8.4.4-1ubuntu1
   libqb-dev 0.16.0.real-1ubuntu3
   libqb0 0.16.0.real-1ubuntu3
   pacemaker 1.1.10+git20130802-1ubuntu2.1
   pacemaker-cli-utils 1.1.10+git20130802-1ubuntu2.1
  
   # pacemaker
   primitive drbd-mysql ocf:linbit:drbd \
   params drbd_resource=mysql \
   op monitor interval=29s role=Master \
   op monitor interval=30s role=Slave
  
   ms ms-drbd-mysql drbd-mysql \
   meta master-max=1 master-node-max=1 clone-max=2
   clone-node-max=1 notify=true
  
   Split-brains are prevented by using reliable fencing (aka stonith).
   You configure stonith in pacemaker (using IPMI/iRMC/iLO/etc,
   switched PDUs, etc). Then you configure DRBD to use the
   crm-fence-peer.sh fence-handler and you set the fencing policy to
   'resource-and-stonith;'.
  
   This way, if all links fail, both nodes block and call a fence. The
   faster one fences (powers off) the slower, and then it begins
   recovery, assured that the peer is not doing the same.
  
   Without stonith/fencing, then there is no defined behaviour. You
   will get split-brains and that is that. Consider; Both nodes lose
   contact with it's peer. Without fencing, both must assume the peer
   is dead and thus take over resources.
  
   That split brains can occur in such a setup that's clear. But I
   would expect pacemaker to stop the drbd resource when the link
   between the cluster nodes is reestablished instead of continue
   running it.
  
  DRBD will refuse to reconnect until it is told which node's data to 
  delete. This is data loss and can not be safely automated.
 
 Sorry if maybe described it unclear but I don't want pacemaker to do an
 automatic split brain recovery. That would not make any sense to me
 either. This decission has to be taken by an administrator.
 
 But is it possible to configure pacemaker to do the following?
  
 - if there are 2 Nodes which can see and communicate with each other
   AND
 - if their disk state is not UpToDate/UpToDate (typically after a split
   brain)
 - then let drbd resource fail because something is obviously broken and
   an administrator has to decide how to continue.
 

This would require resource agent support. But it looks like current
resource agent relies on fencing to resolve split brain situation. As
long as resource agent itself does not indicate resource failure,
there is nothing pacemaker can do. 


   This is why stonith is required in clusters. Even with quorum, you
   can't assume anything about the state of the peer until it is
   fenced, so it would only give you a false sense of security.
  
   Maybe I'll can use resource level fencing.
  
  You need node-level fencing.
 
 I know node level fencing is more secure. But shouldn't resource level
 fencing also work here? e.g.
 (http://www.drbd.org/users-guide/s-pacemaker-fencing.html) 
 
 Currently I can't use ipmi, apc switches or a shared storage device
 for fencing at most fencing via ssh. But what I've read this is also not
 recommended for production setups.
 

You could try meatware stonith agent. This does exactly what you want -
it freezes further processing unless administrator manually declares one
node as down. 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] split brain - after network recovery - resources can still be migrated

2014-10-26 Thread Digimer

On 26/10/14 12:32 PM, Andrei Borzenkov wrote:

В Sun, 26 Oct 2014 12:01:03 +0100
Vladimir m...@foomx.de пишет:


On Sat, 25 Oct 2014 19:11:02 -0400
Digimer li...@alteeve.ca wrote:


On 25/10/14 06:35 PM, Vladimir wrote:

On Sat, 25 Oct 2014 17:30:07 -0400
Digimer li...@alteeve.ca wrote:


On 25/10/14 05:09 PM, Vladimir wrote:

Hi,

currently I'm testing a 2 node setup using ubuntu trusty.

# The scenario:

All communication links betwenn the 2 nodes are cut off. This
results in a split brain situation and both nodes take their
resources online.

When the communication links get back, I see following behaviour:

On drbd level the split brain is detected and the device is
disconnected on both nodes because of after-sb-2pri disconnect
and then it goes to StandAlone ConnectionState.

I'm wondering why pacemaker does not let the resources fail.
It is still possible to migrate resources between the nodes
although they're in StandAlone ConnectionState. After a split
brain that's not what I want.

Is this the expected behaviour? Is it possible to let the
resources fail after the network recovery to avoid fürther data
corruption.

(At the moment I can't use resource or node level fencing in my
setup.)

Here the main part of my config:

# dpkg -l | awk '$2 ~ /^(pacem|coro|drbd|libqb)/{print $2,$3}'
corosync 2.3.3-1ubuntu1
drbd8-utils 2:8.4.4-1ubuntu1
libqb-dev 0.16.0.real-1ubuntu3
libqb0 0.16.0.real-1ubuntu3
pacemaker 1.1.10+git20130802-1ubuntu2.1
pacemaker-cli-utils 1.1.10+git20130802-1ubuntu2.1

# pacemaker
primitive drbd-mysql ocf:linbit:drbd \
params drbd_resource=mysql \
op monitor interval=29s role=Master \
op monitor interval=30s role=Slave

ms ms-drbd-mysql drbd-mysql \
meta master-max=1 master-node-max=1 clone-max=2
clone-node-max=1 notify=true


Split-brains are prevented by using reliable fencing (aka stonith).
You configure stonith in pacemaker (using IPMI/iRMC/iLO/etc,
switched PDUs, etc). Then you configure DRBD to use the
crm-fence-peer.sh fence-handler and you set the fencing policy to
'resource-and-stonith;'.

This way, if all links fail, both nodes block and call a fence. The
faster one fences (powers off) the slower, and then it begins
recovery, assured that the peer is not doing the same.

Without stonith/fencing, then there is no defined behaviour. You
will get split-brains and that is that. Consider; Both nodes lose
contact with it's peer. Without fencing, both must assume the peer
is dead and thus take over resources.


That split brains can occur in such a setup that's clear. But I
would expect pacemaker to stop the drbd resource when the link
between the cluster nodes is reestablished instead of continue
running it.


DRBD will refuse to reconnect until it is told which node's data to
delete. This is data loss and can not be safely automated.


Sorry if maybe described it unclear but I don't want pacemaker to do an
automatic split brain recovery. That would not make any sense to me
either. This decission has to be taken by an administrator.

But is it possible to configure pacemaker to do the following?

- if there are 2 Nodes which can see and communicate with each other
   AND
- if their disk state is not UpToDate/UpToDate (typically after a split
   brain)
- then let drbd resource fail because something is obviously broken and
   an administrator has to decide how to continue.



This would require resource agent support. But it looks like current
resource agent relies on fencing to resolve split brain situation. As
long as resource agent itself does not indicate resource failure,
there is nothing pacemaker can do.



This is why stonith is required in clusters. Even with quorum, you
can't assume anything about the state of the peer until it is
fenced, so it would only give you a false sense of security.


Maybe I'll can use resource level fencing.


You need node-level fencing.


I know node level fencing is more secure. But shouldn't resource level
fencing also work here? e.g.
(http://www.drbd.org/users-guide/s-pacemaker-fencing.html)

Currently I can't use ipmi, apc switches or a shared storage device
for fencing at most fencing via ssh. But what I've read this is also not
recommended for production setups.



You could try meatware stonith agent. This does exactly what you want -
it freezes further processing unless administrator manually declares one
node as down.


Support for this was dropped in the Red Hat world back in 2010ish 
because it was so easy for a panicy admin to clear the fence without 
adequately ensuring the peer node had actually been turned off. I 
*strongly* recommend against using manual fencing. If your nodes have 
IPMI, use that. It's super well tested.


--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] MySQL, Percona replication manager - split brain

2014-10-26 Thread Andrew

26.10.2014 17:44, Andrei Borzenkov пишет:

В Sun, 26 Oct 2014 10:51:13 +0200
Andrew ni...@seti.kr.ua пишет:


26.10.2014 08:32, Andrei Borzenkov пишет:

В Sat, 25 Oct 2014 23:34:54 +0300
Andrew ni...@seti.kr.ua пишет:


25.10.2014 22:34, Digimer пишет:

On 25/10/14 03:32 PM, Andrew wrote:

Hi all.

I use Percona as RA on cluster (nothing mission-critical, currently -
just zabbix data); today after restarting MySQL resource (crm resource
restart p_mysql) I've got a split brain state - MySQL for some reason
started first at ex-slave node, ex-master starts later (possibly I've
set too small timeout to shutdown - only 120s, but I'm not sure).


Your logs do not show resource restart - they show pacemaker restart on
node2.

Yes, you're right. This was a pacemaker restart.



After restart resource on both nodes it seems like mysql replication was
ok - but then after ~50min it fails in split brain again for unknown
reason (no resource restart was noticed).

In 'show replication status' there is an error in table caused by unique
index dup.

So I have a questions:
1) Which thing causes split brain, and how to avoid it in future?

Cause:

Logs?

ct 25 13:54:13 node2 crmd[29248]:   notice: do_state_transition: State
transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Oct 25 13:54:13 node2 pengine[29247]:   notice: unpack_config: On loss
of CCM Quorum: Ignore
Oct 25 13:54:13 node2 pengine[29247]:   notice: unpack_rsc_op: Operation
monitor found resource p_pgsql:0 active in master mode on node1.cluster
Oct 25 13:54:13 node2 pengine[29247]:   notice: unpack_rsc_op: Operation
monitor found resource p_mysql:1 active in master mode on node2.cluster

That seems too late. The real cause is that resource was reported as
being in master state on both nodes and this happened earlier.

This is a different resources (pgsql and mysql)/


Prevent:

Fencing (aka stonith). This is why fencing is required.

No node failure. Just daemon was restarted.


Split brain == loss of communication. It does not matter whether
communication was lost because node failed or because daemon was not
running. There is no way for surviving node to know, *why*
communication was lost.


So how stonith will help in this case? Daemon will be restarted after
it's death if it occures during restart, and stonith will see alive
daemon...

So what is the easiest split-brain solution? Just to stop daemons, and
copy all mysql data from good node to bad one?

There is no split-brain visible in your log. Pacemaker on node2 was
restarted, cleanly as far as I can tell, and reintegrated back in
cluster. May be node1 lost node2, but that needs logs from node1.


Here is log from other node:

Oct 25 13:54:13 node1 pacemakerd[21773]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:54:13 node1 attrd[26079]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: master-p_mysql (56)
Oct 25 13:54:13 node1 attrd[26079]:   notice: attrd_perform_update: Sent 
update 6993: master-p_mysql=56
Oct 25 13:54:16 node1 attrd[26079]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: master-p_mysql (53)
Oct 25 13:54:16 node1 attrd[26079]:   notice: attrd_perform_update: Sent 
update 6995: master-p_mysql=53
Oct 25 13:54:16 node1 pacemakerd[22035]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:54:18 node1 attrd[26079]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: master-p_mysql (60)
Oct 25 13:54:18 node1 attrd[26079]:   notice: attrd_perform_update: Sent 
update 6997: master-p_mysql=60
Oct 25 13:54:18 node1 pacemakerd[22335]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:54:19 node1 pacemakerd[22476]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:54:19 node1 mysql(p_mysql)[22446]: INFO: Ignoring post-demote 
notification execpt for my own demotion.
Oct 25 13:54:19 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation p_mysql_notify_0 (call=2423, rc=0, cib-update=0, 
confirmed=true) ok
Oct 25 13:54:19 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation p_pgsql_notify_0 (call=2425, rc=0, cib-update=0, 
confirmed=true) ok
Oct 25 13:54:20 node1 pacemakerd[22597]:   notice: crm_add_logfile: 
Additional logging available in /var/log/cluster/corosync.log
Oct 25 13:54:20 node1 mysql(p_mysql)[22540]: INFO: Ignoring post-demote 
notification execpt for my own demotion.
Oct 25 13:54:20 node1 crmd[26081]:   notice: process_lrm_event: LRM 
operation p_mysql_notify_0 (call=2433, rc=0, cib-update=0, 
confirmed=true) ok
Oct 25 13:54:20 node1 IPaddr(ClusterIP)[22538]: INFO: Adding inet 
address 192.168.253.254/24 with broadcast address 192.168.253.255 to 
device br0
Oct 25 13:54:20 node1 IPaddr2(pgsql_reader_vip)[22539]: INFO: Adding 
inet address 192.168.253.31/24 

Re: [Pacemaker] MySQL, Percona replication manager - split brain

2014-10-26 Thread Andrew Beekhof
Please send log files as attachments.  The line wrapping makes them very hard 
to read inline.
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Stopping/restarting pacemaker without stopping resources?

2014-10-26 Thread Andrew Beekhof

 On 25 Oct 2014, at 12:38 am, Andrei Borzenkov arvidj...@gmail.com wrote:
 
 On Fri, Oct 24, 2014 at 9:17 AM, Andrew Beekhof and...@beekhof.net wrote:
 
 On 16 Oct 2014, at 9:31 pm, Andrei Borzenkov arvidj...@gmail.com wrote:
 
 The primary goal is to transparently update software in cluster. I
 just did HA suite update using simple RPM and observed that RPM
 attempts to restart stack (rcopenais try-restart). So
 
 a) if it worked, it would mean resources had been migrated from this
 node - interruption
 
 b) it did not work - apparently new versions of installed utils were
 incompatible with running pacemaker so request to shutdown crm fails
 and openais hung forever.
 
 The usual workflow with one cluster products I worked before was -
 stop cluster processes without stopping resources; update; restart
 cluster processes. They would detect that resources are started and
 return to the same state as before stopping. Is something like this
 possible with pacemaker?
 
 absolutely.  this should be of some help:
 
 http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_disconnect_and_reattach.html
 
 
 Did not work. It ended up moving master to another node and leaving
 slave on original node stopped after that.

When you stopped the cluster or when you started it after an upgrade?

 I have crm_report if this
 is not expected ...
 
 pacemaker Version: 1.1.11-3ca8c3b
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CentOS 6 - after update pacemaker floods log with warnings

2014-10-26 Thread Andrew Beekhof
Someone is calling pacemakerd over and over and over.  Don't do that.

 On 26 Oct 2014, at 7:35 am, Andrew ni...@seti.kr.ua wrote:
 
 Hi all.
 After upgrade CentOS to current (Pacemaker 1.1.8-7.el6 to 1.1.10-14.el6_5.3), 
 Pacemaker produces tonns of logs. Near 20GB per day. What may cause this 
 behavior?
 
 Running config:
 node node2.cluster \
attributes p_mysql_mysql_master_IP=192.168.253.4 \
attributes p_pgsql-data-status=STREAMING|SYNC
 node node1.cluster \
attributes p_mysql_mysql_master_IP=192.168.253.5 \
attributes p_pgsql-data-status=LATEST
 primitive ClusterIP ocf:heartbeat:IPaddr \
params ip=192.168.253.254 nic=br0 cidr_netmask=24 \
op monitor interval=2s \
meta target-role=Started
 primitive mysql_reader_vip ocf:heartbeat:IPaddr2 \
params ip=192.168.253.63 nic=br0 cidr_netmask=24 \
op monitor interval=10s \
meta target-role=Started
 primitive mysql_writer_vip ocf:heartbeat:IPaddr2 \
params ip=192.168.253.64 nic=br0 cidr_netmask=24 \
op monitor interval=10s \
meta target-role=Started
 primitive p_mysql ocf:percona:mysql \
params config=/etc/my.cnf pid=/var/lib/mysql/mysqld.pid 
 socket=/var/run/mysqld/mysqld.sock replication_user=***user*** 
 replication_passwd=***passwd*** max_slave_lag=60 
 evict_outdated_slaves=false binary=/usr/libexec/mysqld 
 test_user=***user*** test_passwd=***password*** enable_creation=true \
op monitor interval=5s role=Master timeout=30s OCF_CHECK_LEVEL=1 \
op monitor interval=2s role=Slave timeout=30s OCF_CHECK_LEVEL=1 \
op start interval=0 timeout=120s \
op stop interval=0 timeout=120s
 primitive p_nginx ocf:heartbeat:nginx \
params configfile=/etc/nginx/nginx.conf httpd=/usr/sbin/nginx \
op start interval=0 timeout=60s on-fail=restart \
op monitor interval=10s timeout=30s on-fail=restart depth=0 \
op monitor interval=30s timeout=30s on-fail=restart depth=10 \
op stop interval=0 timeout=120s
 primitive p_perl-fpm ocf:fresh:daemon \
params binfile=/usr/local/bin/perl-fpm cmdline_options=-u nginx -g 
 nginx -x 180 -t 16 -d -P /var/run/perl-fpm/perl-fpm.pid 
 pidfile=/var/run/perl-fpm/perl-fpm.pid \
op start interval=0 timeout=30s \
op monitor interval=10 timeout=20s depth=0 \
op stop interval=0 timeout=30s
 primitive p_pgsql ocf:fresh:pgsql \
params pgctl=/usr/pgsql-9.1/bin/pg_ctl psql=/usr/pgsql-9.1/bin/psql 
 pgdata=/var/lib/pgsql/9.1/data/ start_opt=-p 5432 rep_mode=sync 
 node_list=node2.cluster node1.cluster restore_command=cp 
 /var/lib/pgsql/9.1/wal_archive/%f %p 
 primary_conninfo_opt=keepalives_idle=60 keepalives_interval=5 
 keepalives_count=5 password=***passwd*** repuser=***user*** 
 master_ip=192.168.253.32 stop_escalate=0 \
op start interval=0 timeout=120s on-fail=restart \
op monitor interval=7s timeout=60s on-fail=restart \
op monitor interval=2s role=Master timeout=60s on-fail=restart \
op promote interval=0 timeout=120s on-fail=restart \
op demote interval=0 timeout=120s on-fail=stop \
op stop interval=0 timeout=120s on-fail=block \
op notify interval=0 timeout=90s
 primitive p_radius_ip ocf:heartbeat:IPaddr2 \
params ip=10.255.0.33 nic=lo cidr_netmask=32 \
op monitor interval=10s
 primitive p_radiusd ocf:fresh:daemon \
params binfile=/usr/sbin/radiusd pidfile=/var/run/radiusd/radiusd.pid \
op start interval=0 timeout=30s \
op monitor interval=10 timeout=20s depth=0 \
op stop interval=0 timeout=30s
 primitive p_web_ip ocf:heartbeat:IPaddr2 \
params ip=10.255.0.32 nic=lo cidr_netmask=32 \
op monitor interval=10s
 primitive pgsql_reader_vip ocf:heartbeat:IPaddr2 \
params ip=192.168.253.31 nic=br0 cidr_netmask=24 \
meta resource-stickiness=1 \
op start interval=0 timeout=60s on-fail=restart \
op monitor interval=10s timeout=60s on-fail=restart \
op stop interval=0 timeout=60s on-fail=block
 primitive pgsql_writer_vip ocf:heartbeat:IPaddr2 \
params ip=192.168.253.32 nic=br0 cidr_netmask=24 \
meta migration-threshold=0 \
op start interval=0 timeout=60s on-fail=restart \
op monitor interval=10s timeout=60s on-fail=restart \
op stop interval=0 timeout=60s on-fail=block
 group gr_http p_nginx p_perl-fpm
 ms ms_MySQL p_mysql \
meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 
 notify=true globally-unique=false target-role=Started
 ms ms_Postgresql p_pgsql \
meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 
 notify=true target-role=Started
 clone cl_http gr_http \
meta clone-max=2 clone-node-max=1 target-role=Started
 clone cl_radiusd p_radiusd \
meta clone-max=2 clone-node-max=1 target-role=Started
 location loc-mysql-no-reader-vip mysql_reader_vip \
rule $id=loc-mysql-no-reader-vip-rule -inf: readable eq 0 \
rule $id=loc-mysql-no-reader-vip-rule-0 -inf: not_defined readable
 location mysql_master_location ms_MySQL \
rule $id=mysql_master_location-rule $role=master 110: #uname eq 
 

Re: [Pacemaker] DRBD with Pacemaker on CentOs 6.5

2014-10-26 Thread Sihan Goi
Hi Andrew,

Logs in /var/log/httpd/ are empty, but here's a snippet of
/var/log/messages right after I start pacemaker and do a crm status

http://pastebin.com/ivQdyV4u

Seems like the Apache service doesn't come up. This only happens after I
run the commands in the guide to configure DRBD.

On Fri, Oct 24, 2014 at 8:29 AM, Andrew Beekhof and...@beekhof.net wrote:

 logs?

  On 23 Oct 2014, at 1:08 pm, Sihan Goi gois...@gmail.com wrote:
 
  Hi, can anyone help? Really stuck here...
 
  On Mon, Oct 20, 2014 at 9:46 AM, Sihan Goi gois...@gmail.com wrote:
  Hi,
 
  I'm following the Clusters from Scratch guide for Fedora 13, and I've
 managed to get a 2 node cluster working with Apache. However, once I tried
 to add DRBD 8.4 to the mix, it stopped working.
 
  I've followed the DRBD steps in the guide all the way till cib commit
 fs in Section 7.4, right before Testing Migration. However, when I do a
 crm_mon, I get the following failed actions.
 
  Last updated: Thu Oct 16 17:28:34 2014
  Last change: Thu Oct 16 17:26:04 2014 via crm_shadow on node01
  Stack: cman
  Current DC: node02 - partition with quorum
  Version: 1.1.10-14.el6_5.3-368c726
  2 Nodes configured
  5 Resources configured
 
 
  Online: [ node01 node02 ]
 
  ClusterIP(ocf::heartbeat:IPaddr2):Started node02
   Master/Slave Set: WebDataClone [WebData]
   Masters: [ node02 ]
   Slaves: [ node01 ]
  WebFS   (ocf::heartbeat:Filesystem):Started node02
 
  Failed actions:
  WebSite_start_0 on node02 'unknown error' (1): call=278,
 status=Timed Out, last-rc-change='Thu Oct 16 17:26:28 2014',
 queued=2ms, exec=0ms
  WebSite_start_0 on node01 'unknown error' (1): call=203, status=Timed
  Out, last-rc-change='Thu Oct 16 17:26:09 2014', queued=2ms, exec=0ms
 
  Seems like the apache Website resource isn't starting up. Apache was
  working just fine before I configured DRBD. What did I do wrong?
 
  --
  - Goi Sihan
  gois...@gmail.com
 
 
 
  --
  - Goi Sihan
  gois...@gmail.com
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org




-- 
- Goi Sihan
gois...@gmail.com
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Stopping/restarting pacemaker without stopping resources?

2014-10-26 Thread Andrei Borzenkov
В Mon, 27 Oct 2014 11:09:08 +1100
Andrew Beekhof and...@beekhof.net пишет:

 
  On 25 Oct 2014, at 12:38 am, Andrei Borzenkov arvidj...@gmail.com wrote:
  
  On Fri, Oct 24, 2014 at 9:17 AM, Andrew Beekhof and...@beekhof.net wrote:
  
  On 16 Oct 2014, at 9:31 pm, Andrei Borzenkov arvidj...@gmail.com wrote:
  
  The primary goal is to transparently update software in cluster. I
  just did HA suite update using simple RPM and observed that RPM
  attempts to restart stack (rcopenais try-restart). So
  
  a) if it worked, it would mean resources had been migrated from this
  node - interruption
  
  b) it did not work - apparently new versions of installed utils were
  incompatible with running pacemaker so request to shutdown crm fails
  and openais hung forever.
  
  The usual workflow with one cluster products I worked before was -
  stop cluster processes without stopping resources; update; restart
  cluster processes. They would detect that resources are started and
  return to the same state as before stopping. Is something like this
  possible with pacemaker?
  
  absolutely.  this should be of some help:
  
  http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_disconnect_and_reattach.html
  
  
  Did not work. It ended up moving master to another node and leaving
  slave on original node stopped after that.
 
 When you stopped the cluster or when you started it after an upgrade?

When I started it

crm_attribute -t crm_config -n is-managed-default -v false
rcopenais stop on both nodes
rcopenais start on both node; wait for them to stabilize
crm_attribute -t crm_config -n is-managed-default -v true

It stopped running master/slave, moved master and left slave stopped.

 
  I have crm_report if this
  is not expected ...
  
  pacemaker Version: 1.1.11-3ca8c3b
  
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
  
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Stopping/restarting pacemaker without stopping resources?

2014-10-26 Thread Andrew Beekhof

 On 27 Oct 2014, at 2:30 pm, Andrei Borzenkov arvidj...@gmail.com wrote:
 
 В Mon, 27 Oct 2014 11:09:08 +1100
 Andrew Beekhof and...@beekhof.net пишет:
 
 
 On 25 Oct 2014, at 12:38 am, Andrei Borzenkov arvidj...@gmail.com wrote:
 
 On Fri, Oct 24, 2014 at 9:17 AM, Andrew Beekhof and...@beekhof.net wrote:
 
 On 16 Oct 2014, at 9:31 pm, Andrei Borzenkov arvidj...@gmail.com wrote:
 
 The primary goal is to transparently update software in cluster. I
 just did HA suite update using simple RPM and observed that RPM
 attempts to restart stack (rcopenais try-restart). So
 
 a) if it worked, it would mean resources had been migrated from this
 node - interruption
 
 b) it did not work - apparently new versions of installed utils were
 incompatible with running pacemaker so request to shutdown crm fails
 and openais hung forever.
 
 The usual workflow with one cluster products I worked before was -
 stop cluster processes without stopping resources; update; restart
 cluster processes. They would detect that resources are started and
 return to the same state as before stopping. Is something like this
 possible with pacemaker?
 
 absolutely.  this should be of some help:
 
 http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_disconnect_and_reattach.html
 
 
 Did not work. It ended up moving master to another node and leaving
 slave on original node stopped after that.
 
 When you stopped the cluster or when you started it after an upgrade?
 
 When I started it
 
 crm_attribute -t crm_config -n is-managed-default -v false
 rcopenais stop on both nodes
 rcopenais start on both node; wait for them to stabilize
 crm_attribute -t crm_config -n is-managed-default -v true
 
 It stopped running master/slave, moved master and left slave stopped.

What did crm_mon say before you set is-managed-default back to true?
Did the resource agent properly detect it as running in the master state?
Did the resource agent properly (re)set a preference for being promoted during 
the initial monitor operation?

Pacemaker can do it, but it is dependant on the resources behaving correctly.

 
 
 I have crm_report if this
 is not expected ...
 
 pacemaker Version: 1.1.11-3ca8c3b
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] DRBD with Pacemaker on CentOs 6.5

2014-10-26 Thread Andrew Beekhof
Oct 27 10:28:44 node02 apache(WebSite)[10515]: ERROR: Syntax error on line 292 
of /etc/httpd/conf/httpd.conf: DocumentRoot must be a directory



 On 27 Oct 2014, at 1:36 pm, Sihan Goi gois...@gmail.com wrote:
 
 Hi Andrew,
 
 Logs in /var/log/httpd/ are empty, but here's a snippet of /var/log/messages 
 right after I start pacemaker and do a crm status
 
 http://pastebin.com/ivQdyV4u
 
 Seems like the Apache service doesn't come up. This only happens after I run 
 the commands in the guide to configure DRBD.
 
 On Fri, Oct 24, 2014 at 8:29 AM, Andrew Beekhof and...@beekhof.net wrote:
 logs?
 
  On 23 Oct 2014, at 1:08 pm, Sihan Goi gois...@gmail.com wrote:
 
  Hi, can anyone help? Really stuck here...
 
  On Mon, Oct 20, 2014 at 9:46 AM, Sihan Goi gois...@gmail.com wrote:
  Hi,
 
  I'm following the Clusters from Scratch guide for Fedora 13, and I've 
  managed to get a 2 node cluster working with Apache. However, once I tried 
  to add DRBD 8.4 to the mix, it stopped working.
 
  I've followed the DRBD steps in the guide all the way till cib commit fs 
  in Section 7.4, right before Testing Migration. However, when I do a 
  crm_mon, I get the following failed actions.
 
  Last updated: Thu Oct 16 17:28:34 2014
  Last change: Thu Oct 16 17:26:04 2014 via crm_shadow on node01
  Stack: cman
  Current DC: node02 - partition with quorum
  Version: 1.1.10-14.el6_5.3-368c726
  2 Nodes configured
  5 Resources configured
 
 
  Online: [ node01 node02 ]
 
  ClusterIP(ocf::heartbeat:IPaddr2):Started node02
   Master/Slave Set: WebDataClone [WebData]
   Masters: [ node02 ]
   Slaves: [ node01 ]
  WebFS   (ocf::heartbeat:Filesystem):Started node02
 
  Failed actions:
  WebSite_start_0 on node02 'unknown error' (1): call=278, status=Timed 
  Out, last-rc-change='Thu Oct 16 17:26:28 2014', queued=2ms, exec=0ms
  WebSite_start_0 on node01 'unknown error' (1): call=203, status=Timed
  Out, last-rc-change='Thu Oct 16 17:26:09 2014', queued=2ms, exec=0ms
 
  Seems like the apache Website resource isn't starting up. Apache was
  working just fine before I configured DRBD. What did I do wrong?
 
  --
  - Goi Sihan
  gois...@gmail.com
 
 
 
  --
  - Goi Sihan
  gois...@gmail.com
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 
 -- 
 - Goi Sihan
 gois...@gmail.com
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] communications problems in cluster

2014-10-26 Thread Andrew Beekhof
I know David has been putting a lot of time into the pacemaker-remote stuff 
lately, its quite possible that you're hitting a bug on our side.
Is trying out the latest from Git an option?

Making rpms is pretty easy, just 'make rpm-dep rpm' should be enough.


 On 14 Oct 2014, at 1:31 am, Саша Александров shurr...@gmail.com wrote:
 
 Hi!
 
 Most likely related...
 I have node vm-vmwww with remote-node vmwww. Both are reported online 
 (vmwww:vm-vmwww) and vm-vmwww is reported as 'started on wings1'.
 However, when I try to cleanup faulty failed action  vmwww_start_0 on wings1 
 'unknown error' (1): call=100, status=Timed Out , here is what I get in the 
 log:
 
 Oct 13 18:25:43 wings1 crmd[3844]:  warning: qb_ipcs_event_sendv: 
 new_event_notification (3844-18918-16): Broken pipe (32)
 Oct 13 18:25:43 wings1 crmd[3844]:error: do_lrm_invoke: no lrmd 
 connection for remote node vmwww found on cluster node wings1. Can not 
 process request.
 Oct 13 18:25:43 wings1 crmd[3844]:error: send_msg_via_ipc: Unknown 
 Sub-system (d483a600-5535-4f0d-8ffd-2af391f5cb21)... discarding message.
 Oct 13 18:25:43 wings1 crmd[3844]:error: send_msg_via_ipc: Unknown 
 Sub-system (d483a600-5535-4f0d-8ffd-2af391f5cb21)... discarding message.
 Oct 13 18:25:43 wings1 crmd[3844]:error: send_msg_via_ipc: Unknown 
 Sub-system (d483a600-5535-4f0d-8ffd-2af391f5cb21)... discarding message.
 Oct 13 18:25:43 wings1 crmd[3844]:error: send_msg_via_ipc: Unknown 
 Sub-system (d483a600-5535-4f0d-8ffd-2af391f5cb21)... discarding message.
 
 I go to the VM, and try to run 'crm_mon':
 
 Oct 13 18:27:06 vmwww pacemaker_remoted[3798]:error: ipc_proxy_accept: No 
 ipc providers available for uid 0 gid 0
 Oct 13 18:27:06 vmwww pacemaker_remoted[3798]:error: 
 handle_new_connection: Error in connection setup (3798-3868-13): Remote I/O 
 error (121)
 
 ps aux | grep pace
 root  3798  0.1  0.1  76396  2868 ?S18:16   0:00 
 pacemaker_remoted
 
 netstat -nltp | grep 3121
 tcp0  0 0.0.0.0:31210.0.0.0:*   
 LISTEN  3798/pacemaker_remo
 
 However I can telnet ok:
 
 [root@wings1 ~]# telnet vmwww 3121
 Trying 192.168.222.89...
 Connected to vmwww.
 Escape character is '^]'.
 ^]
 telnet quit
 Connection closed.
 
 This is pretty weird...
 
 Best regards,
 Alex
 
 
 2014-10-13 17:47 GMT+04:00 Саша Александров shurr...@gmail.com:
 Hi!
 
 I was building a cluster with pacemaker+pacemaker-remote  (CentOS 6.5, 
 everything from the official repo).
 While I had several resources, everything was fine. However, when I added 
 more VMs (2 nodes and 10 VMs currently) I started to run into problems (see 
 below).
 Strange thing is that when I start cman/pacemaker some time later - they seem 
 to work fine for some time.
 
 Oct 13 17:03:54 wings1 pacemakerd[26440]:   notice: pcmk_child_exit: Child 
 process crmd terminated with signal 13 (pid=30010, core=0)
 Oct 13 17:03:54 wings1 lrmd[26448]:  warning: qb_ipcs_event_sendv: 
 new_event_notification (26448-30010-6): Bad file descriptor (9)
 Oct 13 17:03:54 wings1 lrmd[26448]:  warning: send_client_notify: 
 Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed
 Oct 13 17:03:54 wings1 pacemakerd[26440]:   notice: pcmk_process_exit: 
 Respawning failed child process: crmd
 Oct 13 17:03:54 wings1 lrmd[26448]:  warning: send_client_notify: 
 Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed
 Oct 13 17:03:54 wings1 lrmd[26448]:  warning: send_client_notify: 
 Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed
 Oct 13 17:03:54 wings1 lrmd[26448]:  warning: send_client_notify: 
 Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed
 Oct 13 17:03:54 wings1 lrmd[26448]:  warning: send_client_notify: 
 Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed
 Oct 13 17:03:54 wings1 lrmd[26448]:  warning: send_client_notify: 
 Notification of client crmd/665bd130-2630-454b-9102-3f17d2bd71f3 failed
 
 Oct 13 17:03:57 wings1 pacemakerd[26440]:   notice: pcmk_child_exit: Child 
 process crmd terminated with signal 13 (pid=30603, core=0)
 Oct 13 17:03:57 wings1 lrmd[26448]:  warning: qb_ipcs_event_sendv: 
 new_event_notification (26448-30603-6): Bad file descriptor (9)
 Oct 13 17:03:57 wings1 lrmd[26448]:  warning: send_client_notify: 
 Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed
 Oct 13 17:03:57 wings1 pacemakerd[26440]:   notice: pcmk_process_exit: 
 Respawning failed child process: crmd
 Oct 13 17:03:57 wings1 lrmd[26448]:  warning: send_client_notify: 
 Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed
 Oct 13 17:03:57 wings1 lrmd[26448]:  warning: send_client_notify: 
 Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed
 Oct 13 17:03:57 wings1 lrmd[26448]:  warning: send_client_notify: 
 Notification of client crmd/820ac884-24ca-4fff-9dc8-0a09e82e0e0a failed
 Oct 13 17:03:57