Re: [Pacemaker] DC election with downed node in 2-way cluster

2010-01-18 Thread Andrew Beekhof
On Thu, Jan 14, 2010 at 4:40 AM, Miki Shapiro  wrote:
>>> And the node really did power down?
> Yes. 100% certain and positive. OFF.
>
>>> But the other node didn't notice?!?
> Its resources (drbd master and the fence clone) did notice.
> Its dc-election-mechanism did NOT notice (and the survivor didn't re-elect)
> Its quorum-election mechanism did NOT notice (and the survivor still thinks 
> it has quorum).
>
> Logs attached.

Hmmm.
Not much to see there. crmd gets the membership event and then just
sort of stops.
Could you try again with debug turned on in openais.conf please?

>
> Keep in mind I'm relatively new to this. PEBKAC not entirely outside the 
> realm of the possible ;)

Doesn't look like it, but you might want to try something a little
more recent than 1.0.3.

> Thanks!
>
> -Original Message-
> From: Andrew Beekhof [mailto:and...@beekhof.net]
> Sent: Wednesday, 13 January 2010 7:26 PM
> To: pacemaker@oss.clusterlabs.org
> Subject: Re: [Pacemaker] DC election with downed node in 2-way cluster
>
> On Wed, Jan 13, 2010 at 9:12 AM, Miki Shapiro  
> wrote:
>> Halt = soft off - a natively issued poweroff command that shuts stuff down
>> nicely, then powers the blade off.
>
> And the node really did power down?
> But the other node didn't notice?!? That is insanely bad - looking
> forward to those logs.
>
>> Logs I'll send tomorrow (our timezone is just wrapping up for the day).
>
> Yep, I'm actually an Aussie too... just not living there at the moment :-)
>
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> __
> This email and any attachments may contain privileged and confidential
> information and are intended for the named addressee only. If you have
> received this e-mail in error, please notify the sender and delete
> this e-mail immediately. Any confidentiality, privilege or copyright
> is not waived or lost because this e-mail has been sent to you in
> error. It is your responsibility to check this e-mail and any
> attachments for viruses.  No warranty is made that this material is
> free from computer virus or any other defect or error.  Any
> loss/damage incurred by using this material is not the sender's
> responsibility.  The sender's entire liability will be limited to
> resupplying the material.
> __
> ___
> Pacemaker mailing list
> Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
>

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] DC election with downed node in 2-way cluster

2010-01-13 Thread Miki Shapiro
>> And the node really did power down?
Yes. 100% certain and positive. OFF.

>> But the other node didn't notice?!?
Its resources (drbd master and the fence clone) did notice.
Its dc-election-mechanism did NOT notice (and the survivor didn't re-elect)
Its quorum-election mechanism did NOT notice (and the survivor still thinks it 
has quorum). 

Logs attached. 

Keep in mind I'm relatively new to this. PEBKAC not entirely outside the realm 
of the possible ;)

Thanks!

-Original Message-
From: Andrew Beekhof [mailto:and...@beekhof.net] 
Sent: Wednesday, 13 January 2010 7:26 PM
To: pacemaker@oss.clusterlabs.org
Subject: Re: [Pacemaker] DC election with downed node in 2-way cluster

On Wed, Jan 13, 2010 at 9:12 AM, Miki Shapiro  wrote:
> Halt = soft off - a natively issued poweroff command that shuts stuff down
> nicely, then powers the blade off.

And the node really did power down?
But the other node didn't notice?!? That is insanely bad - looking
forward to those logs.

> Logs I'll send tomorrow (our timezone is just wrapping up for the day).

Yep, I'm actually an Aussie too... just not living there at the moment :-)

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

__
This email and any attachments may contain privileged and confidential
information and are intended for the named addressee only. If you have
received this e-mail in error, please notify the sender and delete
this e-mail immediately. Any confidentiality, privilege or copyright
is not waived or lost because this e-mail has been sent to you in
error. It is your responsibility to check this e-mail and any
attachments for viruses.  No warranty is made that this material is
free from computer virus or any other defect or error.  Any
loss/damage incurred by using this material is not the sender's
responsibility.  The sender's entire liability will be limited to
resupplying the material.
__

pacemaker-problem.tbz2
Description: pacemaker-problem.tbz2
___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] DC election with downed node in 2-way cluster

2010-01-13 Thread Andrew Beekhof
On Wed, Jan 13, 2010 at 9:12 AM, Miki Shapiro  wrote:
> Halt = soft off - a natively issued poweroff command that shuts stuff down
> nicely, then powers the blade off.

And the node really did power down?
But the other node didn't notice?!? That is insanely bad - looking
forward to those logs.

> Logs I’ll send tomorrow (our timezone is just wrapping up for the day).

Yep, I'm actually an Aussie too... just not living there at the moment :-)

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] DC election with downed node in 2-way cluster

2010-01-13 Thread Miki Shapiro
Halt = soft off - a natively issued poweroff command that shuts stuff down 
nicely, then powers the blade off.

Logs I'll send tomorrow (our timezone is just wrapping up for the day).

Thanks!

From: Andrew Beekhof [mailto:and...@beekhof.net]
Sent: Wednesday, 13 January 2010 7:07 PM
To: pacemaker@oss.clusterlabs.org
Subject: Re: [Pacemaker] DC election with downed node in 2-way cluster


On Wed, Jan 13, 2010 at 3:25 AM, Miki Shapiro 
mailto:miki.shap...@coles.com.au>> wrote:
Hi all

I'm attempting to build a 2-way cluster, SLES-11-based with an 
openais/pacemaker stack. I've got the nodes and a resource (a drbd volume) 
happening. What I'm not sure about is the active CRM DC election process.

I configured a null stonith resource for each node.
I have stonith-enabled set to true ( I will implement a real stonith facility 
once final solution is in place)
I have no-quorum-policy set to ignore (as the cluster is expected to work with 
one node active).

I look at crm_mon or crm_gui, and it's all green and happy.

I now go and halt a node.

define "halt"


Observing crm_mon or crm_gui on node2, I expect to see :

1.   Services appear as down thanks to resource monitoring directives.

2.   The quorum broken (... do I care?)

3.   The new node elected as DC. Despite what the book states (here: < 
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-cluster-status.html
 > at the bottom)  that:

"The DC (Designated Controller) node is where all the decisions are made and if 
the current DC fails a new one is elected from the remaining cluster nodes. The 
choice of DC is of no significance to an administrator beyond the fact that its 
logs will generally be more interesting."



Is of significance. I want the brain, in as far as the surviving node is 
concerned, to be running on a non-halted server.


What happens in practice is:
If I halt the DC,

1.   Resources DO appear stopped and do-their-thing(tm)

2.   [PROBLEM?] Quorum DOES NOT appear as broken

3.   [PROBLEM?] The remaining node DOES NOT get (visibly) elected as the 
new DC.
If I halted the non-DC node,

1.   Resources DO appear stopped and do-their-thing(tm)

2.   Quorum DOES appear as broken

3.   [PROBLEM?]The remaining node DOES NOT get (visibly) elected as the new 
DC.

Now if my understanding serves me right, the DC is the baton-holding CRM that 
does the thinking for the entire cluster. If the surviving node1 think that the 
(DEAD) node2 is the de-facto brains of the cluster and doesn't take the reigns, 
I have a dysfunctional cluster.

Can someone please offer some clarification on how one would reasonably expect 
this to work?

Not without logs (one per scenario as bzip'd attchments please).

__
This email and any attachments may contain privileged and confidential
information and are intended for the named addressee only. If you have
received this e-mail in error, please notify the sender and delete
this e-mail immediately. Any confidentiality, privilege or copyright
is not waived or lost because this e-mail has been sent to you in
error. It is your responsibility to check this e-mail and any
attachments for viruses.  No warranty is made that this material is
free from computer virus or any other defect or error.  Any
loss/damage incurred by using this material is not the sender's
responsibility.  The sender's entire liability will be limited to
resupplying the material.
_
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] DC election with downed node in 2-way cluster

2010-01-13 Thread Andrew Beekhof
On Wed, Jan 13, 2010 at 3:25 AM, Miki Shapiro wrote:

>  Hi all
>
>
>
> I’m attempting to build a 2-way cluster, SLES-11-based with an
> openais/pacemaker stack. I’ve got the nodes and a resource (a drbd volume)
> happening. What I’m not sure about is the active CRM DC election process.
>
>
>
> I configured a null stonith resource for each node.
>
> I have stonith-enabled set to true ( I will implement a real stonith
> facility once final solution is in place)
>
> I have no-quorum-policy set to ignore (as the cluster is expected to work
> with one node active).
>
>
>
> I look at crm_mon or crm_gui, and it’s all green and happy.
>
>
>
> I now go and halt a node.
>

define "halt"


>
>
> Observing crm_mon or crm_gui on node2, I expect to see :
>
> 1.   Services appear as down thanks to resource monitoring directives.
>
> 2.   The quorum broken (… do I care?)
>
> 3.   The new node elected as DC. Despite what the book states (here: <
>
> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-cluster-status.html>
>  at the bottom)  that:
>
> *“The DC (Designated Controller) node is where all the decisions are made
> and if the current DC fails a new one is elected from the remaining cluster
> nodes. The choice of DC is of no significance to an administrator beyond the
> fact that its logs will generally be more interesting.”*
>
>
>
> Is of significance. I want the brain, in as far as the surviving node is
> concerned, to be running on a non-halted server.
>
>
>
> What happens in practice is:
>
> If I halt the DC,
>
> 1.   Resources DO appear stopped and do-their-thing™
>
> 2.   [PROBLEM?] Quorum DOES NOT appear as broken
>
> 3.   [PROBLEM?] The remaining node DOES NOT get (visibly) elected as
> the new DC.
>
> If I halted the non-DC node,
>
> 1.   Resources DO appear stopped and do-their-thing™
>
> 2.   Quorum DOES appear as broken
>
> 3.   [PROBLEM?]The remaining node DOES NOT get (visibly) elected as
> the new DC.
>
>
>
> Now if my understanding serves me right, the DC is the baton-holding CRM
> that does the thinking for the entire cluster. If the surviving node1 think
> that the (DEAD) node2 is the de-facto brains of the cluster and doesn’t take
> the reigns, I have a dysfunctional cluster.
>
>
>
> Can someone please offer some clarification on how one would reasonably
> expect this to work?
>

Not without logs (one per scenario as bzip'd attchments please).
___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


[Pacemaker] DC election with downed node in 2-way cluster

2010-01-12 Thread Miki Shapiro
Hi all

I'm attempting to build a 2-way cluster, SLES-11-based with an 
openais/pacemaker stack. I've got the nodes and a resource (a drbd volume) 
happening. What I'm not sure about is the active CRM DC election process.

I configured a null stonith resource for each node.
I have stonith-enabled set to true ( I will implement a real stonith facility 
once final solution is in place)
I have no-quorum-policy set to ignore (as the cluster is expected to work with 
one node active).

I look at crm_mon or crm_gui, and it's all green and happy.

I now go and halt a node.

Observing crm_mon or crm_gui on node2, I expect to see :

1.   Services appear as down thanks to resource monitoring directives.

2.   The quorum broken (... do I care?)

3.   The new node elected as DC. Despite what the book states (here: < 
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-cluster-status.html
 > at the bottom)  that:

"The DC (Designated Controller) node is where all the decisions are made and if 
the current DC fails a new one is elected from the remaining cluster nodes. The 
choice of DC is of no significance to an administrator beyond the fact that its 
logs will generally be more interesting."



Is of significance. I want the brain, in as far as the surviving node is 
concerned, to be running on a non-halted server.


What happens in practice is:
If I halt the DC,

1.   Resources DO appear stopped and do-their-thing(tm)

2.   [PROBLEM?] Quorum DOES NOT appear as broken

3.   [PROBLEM?] The remaining node DOES NOT get (visibly) elected as the 
new DC.
If I halted the non-DC node,

1.   Resources DO appear stopped and do-their-thing(tm)

2.   Quorum DOES appear as broken

3.   [PROBLEM?]The remaining node DOES NOT get (visibly) elected as the new 
DC.

Now if my understanding serves me right, the DC is the baton-holding CRM that 
does the thinking for the entire cluster. If the surviving node1 think that the 
(DEAD) node2 is the de-facto brains of the cluster and doesn't take the reigns, 
I have a dysfunctional cluster.

Can someone please offer some clarification on how one would reasonably expect 
this to work?

Thanks!


Miki Shapiro
Linux Systems Engineer
Infrastructure Services & Operations

[cid:image001.png@01CA9453.D1ECBD70]
745 Springvale Road
Mulgrave 3170 Australia
Email miki.shap...@coles.com.au
Phone: 61 3 854 10520
Fax: 61 3 854 10558



__
This email and any attachments may contain privileged and confidential
information and are intended for the named addressee only. If you have
received this e-mail in error, please notify the sender and delete
this e-mail immediately. Any confidentiality, privilege or copyright
is not waived or lost because this e-mail has been sent to you in
error. It is your responsibility to check this e-mail and any
attachments for viruses.  No warranty is made that this material is
free from computer virus or any other defect or error.  Any
loss/damage incurred by using this material is not the sender's
responsibility.  The sender's entire liability will be limited to
resupplying the material.
__<><>___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker