Re: [Pacemaker] Y should pacemaker be started simultaneously.

2014-10-17 Thread Andrei Borzenkov
В Mon, 06 Oct 2014 10:27:49 -0400
Digimer li...@alteeve.ca пишет:

 On 06/10/14 02:11 AM, Andrei Borzenkov wrote:
  On Mon, Oct 6, 2014 at 9:03 AM, Digimer li...@alteeve.ca wrote:
  If stonith was configured, after the time out, the first node would fence
  the second node (unable to reach != off).
 
  Alternatively, you can set corosync to 'wait_for_all' and have the first
  node do nothing until it sees the peer.
 
 
  Am I right that wait_for_all is available only in corosync 2.x and not in 
  1.x?
 
 You are correct, yes.
 
  To do otherwise would be to risk a split-brain. Each node needs to know the
  state of the peer in order to run services safely. By having both start at
  the same time, then they know what the other is doing. By disabling quorum,
  you allow one node to continue to operate when the other leaves, but it
  needs that initial connection to know for sure what it's doing.
 
 
  Does it apply to both corosync 1.x and 2.x or only to 2.x with
  wait_for_all? Because I actually also was confused about precise
  meaning of disabling quorum in pacemaker (setting no-quorum-policy:
  ignore). So if I have two node cluster with pacemaker 1.x and corosync
  1.x with no-quorum-policy=ignore and no fencing - what happens when
  one single node starts?
 
 Quorum tells the cluster that if a peer leaves (gracefully or was 
 fenced), the remaining node is allowed to continue providing services.
 
 Stonith is needed to put a node that is in an unknown state into a known 
 state; Be it because it couldn't reach the node when starting or because 
 the node stopped responding.
 
 So quorum and stonith play rather different roles.
 
 Without stonith, regardless of quorum, you risk split-brains and/or data 
 corruption. Operating a cluster without stonith is to operate a cluster 
 in an undermined state and should never be done.
 

OK I try to rephrase. Is it possible to achieve the same effect as
wait_for_all in corosync 2.x with combination of pacemaker 1.1.x and
corosync 1.x? I.e. ensure that cluster does not come up *on the
first startup* until all nodes are present? So just make cluster nodes
wait for others to join instead of trying to stonith them?

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Y should pacemaker be started simultaneously.

2014-10-17 Thread Digimer

On 18/10/14 12:18 AM, Andrei Borzenkov wrote:

В Mon, 06 Oct 2014 10:27:49 -0400
Digimer li...@alteeve.ca пишет:


On 06/10/14 02:11 AM, Andrei Borzenkov wrote:

On Mon, Oct 6, 2014 at 9:03 AM, Digimer li...@alteeve.ca wrote:

If stonith was configured, after the time out, the first node would fence
the second node (unable to reach != off).

Alternatively, you can set corosync to 'wait_for_all' and have the first
node do nothing until it sees the peer.



Am I right that wait_for_all is available only in corosync 2.x and not in 1.x?


You are correct, yes.


To do otherwise would be to risk a split-brain. Each node needs to know the
state of the peer in order to run services safely. By having both start at
the same time, then they know what the other is doing. By disabling quorum,
you allow one node to continue to operate when the other leaves, but it
needs that initial connection to know for sure what it's doing.



Does it apply to both corosync 1.x and 2.x or only to 2.x with
wait_for_all? Because I actually also was confused about precise
meaning of disabling quorum in pacemaker (setting no-quorum-policy:
ignore). So if I have two node cluster with pacemaker 1.x and corosync
1.x with no-quorum-policy=ignore and no fencing - what happens when
one single node starts?


Quorum tells the cluster that if a peer leaves (gracefully or was
fenced), the remaining node is allowed to continue providing services.

Stonith is needed to put a node that is in an unknown state into a known
state; Be it because it couldn't reach the node when starting or because
the node stopped responding.

So quorum and stonith play rather different roles.

Without stonith, regardless of quorum, you risk split-brains and/or data
corruption. Operating a cluster without stonith is to operate a cluster
in an undermined state and should never be done.



OK I try to rephrase. Is it possible to achieve the same effect as
wait_for_all in corosync 2.x with combination of pacemaker 1.1.x and
corosync 1.x? I.e. ensure that cluster does not come up *on the
first startup* until all nodes are present? So just make cluster nodes
wait for others to join instead of trying to stonith them?


No, not that I know of. To achieve the same behaviour, I wrote my own 
program[1] to do this. It is called on boot and waits for the peer to 
become reachable, then it starts the cluster stack. So the same effect 
is gained, but it's done outside corosync directly.


Note that I write it for corosync 1.x + cman + rgmanager, but the 
concepts port trivially.


digimer

1. https://github.com/digimer/an-cdb/blob/master/tools/safe_anvil_start

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Y should pacemaker be started simultaneously.

2014-10-06 Thread Digimer

On 06/10/14 02:11 AM, Andrei Borzenkov wrote:

On Mon, Oct 6, 2014 at 9:03 AM, Digimer li...@alteeve.ca wrote:

If stonith was configured, after the time out, the first node would fence
the second node (unable to reach != off).

Alternatively, you can set corosync to 'wait_for_all' and have the first
node do nothing until it sees the peer.



Am I right that wait_for_all is available only in corosync 2.x and not in 1.x?


You are correct, yes.


To do otherwise would be to risk a split-brain. Each node needs to know the
state of the peer in order to run services safely. By having both start at
the same time, then they know what the other is doing. By disabling quorum,
you allow one node to continue to operate when the other leaves, but it
needs that initial connection to know for sure what it's doing.



Does it apply to both corosync 1.x and 2.x or only to 2.x with
wait_for_all? Because I actually also was confused about precise
meaning of disabling quorum in pacemaker (setting no-quorum-policy:
ignore). So if I have two node cluster with pacemaker 1.x and corosync
1.x with no-quorum-policy=ignore and no fencing - what happens when
one single node starts?


Quorum tells the cluster that if a peer leaves (gracefully or was 
fenced), the remaining node is allowed to continue providing services.


Stonith is needed to put a node that is in an unknown state into a known 
state; Be it because it couldn't reach the node when starting or because 
the node stopped responding.


So quorum and stonith play rather different roles.

Without stonith, regardless of quorum, you risk split-brains and/or data 
corruption. Operating a cluster without stonith is to operate a cluster 
in an undermined state and should never be done.


--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Y should pacemaker be started simultaneously.

2014-10-05 Thread N, Ravikiran
Hi all,

I had this question from a while, did not understand the logic for it.
Why should I have to start pacemaker simultaneously on both of my nodes (of a 2 
node cluster) simultaneously, although I have disabled quorum in the cluster.
It fails in the startup step of

[root@rk16 ~]# service pacemaker start
Starting cluster:
   Checking if cluster has been disabled at boot...[  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules...   [  OK  ]
   Mounting configfs...[  OK  ]
   Starting cman...[  OK  ]
   Waiting for quorum... Timed-out waiting for cluster
   [FAILED]
Stopping cluster:
   Leaving fence domain... [  OK  ]
   Stopping gfs_controld...[  OK  ]
   Stopping dlm_controld...[  OK  ]
   Stopping fenced...  [  OK  ]
   Stopping cman...[  OK  ]
   Waiting for corosync to shutdown:.  [  OK  ]
   Unloading kernel modules... [  OK  ]
   Unmounting configfs...  [  OK  ]
Starting Pacemaker Cluster Manager:[  OK  ]
[root@rk16 ~]# service pacemaker status
pacemakerd dead but pid file exists
[root@rk16 ~]#

Regards,
Ravikiran N


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Y should pacemaker be started simultaneously.

2014-10-05 Thread Digimer
If stonith was configured, after the time out, the first node would 
fence the second node (unable to reach != off).


Alternatively, you can set corosync to 'wait_for_all' and have the first 
node do nothing until it sees the peer.


To do otherwise would be to risk a split-brain. Each node needs to know 
the state of the peer in order to run services safely. By having both 
start at the same time, then they know what the other is doing. By 
disabling quorum, you allow one node to continue to operate when the 
other leaves, but it needs that initial connection to know for sure what 
it's doing.


Alternatively, by fencing the peer on start after timing out, it can say 
for sure that the peer is off and then start services knowing it won't 
cause a split-brain. Of course, if you auto-start the cluster and don't 
use wait_for_all, you risk a fence loop.


digimer

On 06/10/14 12:45 AM, N, Ravikiran wrote:

Hi all,

I had this question from a while, did not understand the logic for it.

Why should I have to start pacemaker simultaneously on both of my nodes
(of a 2 node cluster) simultaneously, although I have disabled quorum in
the cluster.

It fails in the startup step of

/[root@rk16 ~]# service pacemaker start/

/Starting cluster:/

/   Checking if cluster has been disabled at boot...[  OK  ]/

/   Checking Network Manager... [  OK  ]/

/   Global setup... [  OK  ]/

/   Loading kernel modules...   [  OK  ]/

/   Mounting configfs...[  OK  ]/

/   Starting cman...[  OK  ]/

/   Waiting for quorum... Timed-out waiting for cluster/

/   [FAILED]/

/Stopping cluster:/

/   Leaving fence domain... [  OK  ]/

/   Stopping gfs_controld...[  OK  ]/

/   Stopping dlm_controld...[  OK  ]/

/   Stopping fenced...  [  OK  ]/

/   Stopping cman...[  OK  ]/

/   Waiting for corosync to shutdown:.  [  OK  ]/

/   Unloading kernel modules... [  OK  ]/

/   Unmounting configfs...  [  OK  ]/

/Starting Pacemaker Cluster Manager:[  OK  ]/

/[root@rk16 ~]# service pacemaker status/

/pacemakerd dead but pid file exists/

/[root@rk16 ~]#/

Regards,

Ravikiran N



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org