Re: [Pacemaker] why so long to stonith?

2013-04-25 Thread Andrew Beekhof

On 26/04/2013, at 10:24 AM, David Coulson  wrote:

> 
> On 4/25/13 7:43 PM, Andrew Beekhof wrote:
>> I certainly hope so :)
> So I should complain to our sales people about this BZ before we upgrade our 
> clusters to 6.4?

Actually, I'm going to back-track on this.
After further investigation it appears only plugin based clusters (ie. those 
using corosync.conf) are affected.  
You won't have any problems if, as recommended, you use cman.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] why so long to stonith?

2013-04-25 Thread Andrew Beekhof

On 26/04/2013, at 10:24 AM, David Coulson  wrote:

> 
> On 4/25/13 7:43 PM, Andrew Beekhof wrote:
>> I certainly hope so :)
> So I should complain to our sales people about this BZ before we upgrade our 
> clusters to 6.4?

I don't think it would hurt to demonstrate how many people are using it.
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] why so long to stonith?

2013-04-25 Thread David Coulson


On 4/25/13 7:43 PM, Andrew Beekhof wrote:

I certainly hope so :)
So I should complain to our sales people about this BZ before we upgrade 
our clusters to 6.4?


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] why so long to stonith?

2013-04-25 Thread Andrew Beekhof

On 25/04/2013, at 5:22 AM, Brian J. Murrell  wrote:

> On 13-04-24 01:16 AM, Andrew Beekhof wrote:
>> 
>> Almost certainly you are hitting:
>> 
>>https://bugzilla.redhat.com/show_bug.cgi?id=951340
> 
> Yup.  The patch posted there fixed it.
> 
>> I am doing my best to convince people that make decisions that this is 
>> worthy of an update before 6.5.
> 
> I've added my voice to the bug, if that's any help.

I certainly hope so :)


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] why so long to stonith?

2013-04-24 Thread Brian J. Murrell
On 13-04-24 01:16 AM, Andrew Beekhof wrote:
> 
> Almost certainly you are hitting:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=951340

Yup.  The patch posted there fixed it.

> I am doing my best to convince people that make decisions that this is worthy 
> of an update before 6.5.

I've added my voice to the bug, if that's any help.

b.




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] why so long to stonith?

2013-04-23 Thread Andrew Beekhof

On 24/04/2013, at 5:34 AM, Brian J. Murrell  wrote:

> Using pacemaker 1.1.8 on RHEL 6.4, I did a test where I just killed
> (-KILL) corosync on a peer node.  Pacemaker seemed to take a long time
> to transition to stonithing it though after noticing it was AWOL:

[snip]

> As you can see, 3 minutes and 10 seconds went by before pacemaker
> transitioned from noticing the node unresponsive to stonithing it.
> 
> This smacks of some kind of mis-configured timeout but I'm not aware
> of any timeout that would have this effect.
> 
> Thoughts?
> b.

Almost certainly you are hitting:

https://bugzilla.redhat.com/show_bug.cgi?id=951340

I am doing my best to convince people that make decisions that this is worthy 
of an update before 6.5.
The mystery at the moment is why some clusters (ie. all the ones we tested on 
internally) seem unaffected.
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] why so long to stonith?

2013-04-23 Thread Digimer
As I understand it, this is a known issue with the 1.1.8 release. I 
believe that 1.1.9 is now available from the pacemaker repos and it 
should fix the problem.


digimer

On 04/23/2013 03:34 PM, Brian J. Murrell wrote:

Using pacemaker 1.1.8 on RHEL 6.4, I did a test where I just killed
(-KILL) corosync on a peer node.  Pacemaker seemed to take a long time
to transition to stonithing it though after noticing it was AWOL:

Apr 23 19:05:20 node2 corosync[1324]:   [TOTEM ] A processor failed, forming 
new configuration.
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] notice: pcmk_peer_update: 
Transitional membership event on ring 188: memb=1, new=0, lost=1
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: memb: 
node2 2608507072
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: lost: 
node1 4252674240
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] notice: pcmk_peer_update: 
Stable membership event on ring 188: memb=1, new=0, lost=0
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
node2 2608507072
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: 
ais_mark_unseen_peer_dead: Node node1 was not seen in the previous transition
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: update_member: Node 
4252674240/node1 is now: lost
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: 
send_member_notification: Sending membership update 188 to 2 children
Apr 23 19:05:21 node2 corosync[1324]:   [TOTEM ] A processor joined or left the 
membership and a new membership was formed.
Apr 23 19:05:21 node2 corosync[1324]:   [CPG   ] chosen downlist: sender r(0) 
ip(192.168.122.155) ; members(old:2 left:1)
Apr 23 19:05:21 node2 corosync[1324]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Apr 23 19:05:21 node2 crmd[1634]:   notice: ais_dispatch_message: Membership 
188: quorum lost
Apr 23 19:05:21 node2 crmd[1634]:   notice: crm_update_peer_state: 
crm_update_ais_node: Node node1[4252674240] - state is now lost
Apr 23 19:05:21 node2 cib[1629]:   notice: ais_dispatch_message: Membership 
188: quorum lost
Apr 23 19:05:21 node2 cib[1629]:   notice: crm_update_peer_state: 
crm_update_ais_node: Node node1[4252674240] - state is now lost
Apr 23 19:08:31 node2 crmd[1634]:   notice: do_state_transition: State transition 
S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED 
origin=crm_timer_popped ]
Apr 23 19:08:31 node2 pengine[1633]:   notice: unpack_config: On loss of CCM 
Quorum: Ignore
Apr 23 19:08:31 node2 pengine[1633]:  warning: pe_fence_node: Node node1 will 
be fenced because the node is no longer part of the cluster
Apr 23 19:08:31 node2 pengine[1633]:  warning: determine_online_status: Node 
node1 is unclean
Apr 23 19:08:31 node2 pengine[1633]:  warning: custom_action: Action 
MGS_e4a31b_stop_0 on node1 is unrunnable (offline)
Apr 23 19:08:31 node2 pengine[1633]:  warning: stage6: Scheduling Node node1 
for STONITH
Apr 23 19:08:31 node2 pengine[1633]:   notice: LogActions: Move
MGS_e4a31b#011(Started node1 -> node2)
Apr 23 19:08:31 node2 crmd[1634]:   notice: te_fence_node: Executing reboot 
fencing operation (15) on node1 (timeout=6)
Apr 23 19:08:31 node2 stonith-ng[1630]:   notice: handle_request: Client 
crmd.1634.642b9c6e wants to fence (reboot) 'node1' with device '(any)'
Apr 23 19:08:31 node2 stonith-ng[1630]:   notice: initiate_remote_stonith_op: 
Initiating remote operation reboot for node1: 
fb431eb4-789c-41bc-903e-4041d50e93b4 (0)
Apr 23 19:08:31 node2 pengine[1633]:  warning: process_pe_message: Calculated 
Transition 115: /var/lib/pacemaker/pengine/pe-warn-7.bz2
Apr 23 19:08:41 node2 stonith-ng[1630]:   notice: log_operation: Operation 
'reboot' [27682] (call 0 from crmd.1634) for host 'node1' with device 
'st-node1' returned: 0 (OK)
Apr 23 19:08:41 node2 stonith-ng[1630]:   notice: remote_op_done: Operation 
reboot of node1 by node2 for crmd.1634@node2.fb431eb4: OK
Apr 23 19:08:41 node2 crmd[1634]:   notice: tengine_stonith_callback: Stonith 
operation 3/15:115:0:c118573c-84e3-48bd-8dc9-40de24438385: OK (0)
Apr 23 19:08:41 node2 crmd[1634]:   notice: tengine_stonith_notify: Peer node1 
was terminated (st_notify_fence) by node2 for node2: OK 
(ref=fb431eb4-789c-41bc-903e-4041d50e93b4) by client crmd.1634
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] notice: pcmk_peer_update: 
Transitional membership event on ring 192: memb=1, new=0, lost=0
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: memb: 
node2 2608507072
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] notice: pcmk_peer_update: 
Stable membership event on ring 192: memb=2, new=1, lost=0
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: update_member: Node 
4252674240/node1 is now: member
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: NEW:  
node1 4252674240
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
node2 2608507072
Apr 23 

[Pacemaker] why so long to stonith?

2013-04-23 Thread Brian J. Murrell
Using pacemaker 1.1.8 on RHEL 6.4, I did a test where I just killed
(-KILL) corosync on a peer node.  Pacemaker seemed to take a long time
to transition to stonithing it though after noticing it was AWOL:

Apr 23 19:05:20 node2 corosync[1324]:   [TOTEM ] A processor failed, forming 
new configuration.
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] notice: pcmk_peer_update: 
Transitional membership event on ring 188: memb=1, new=0, lost=1
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: memb: 
node2 2608507072
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: lost: 
node1 4252674240
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] notice: pcmk_peer_update: 
Stable membership event on ring 188: memb=1, new=0, lost=0
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
node2 2608507072
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: 
ais_mark_unseen_peer_dead: Node node1 was not seen in the previous transition
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: update_member: Node 
4252674240/node1 is now: lost
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: 
send_member_notification: Sending membership update 188 to 2 children
Apr 23 19:05:21 node2 corosync[1324]:   [TOTEM ] A processor joined or left the 
membership and a new membership was formed.
Apr 23 19:05:21 node2 corosync[1324]:   [CPG   ] chosen downlist: sender r(0) 
ip(192.168.122.155) ; members(old:2 left:1)
Apr 23 19:05:21 node2 corosync[1324]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Apr 23 19:05:21 node2 crmd[1634]:   notice: ais_dispatch_message: Membership 
188: quorum lost
Apr 23 19:05:21 node2 crmd[1634]:   notice: crm_update_peer_state: 
crm_update_ais_node: Node node1[4252674240] - state is now lost
Apr 23 19:05:21 node2 cib[1629]:   notice: ais_dispatch_message: Membership 
188: quorum lost
Apr 23 19:05:21 node2 cib[1629]:   notice: crm_update_peer_state: 
crm_update_ais_node: Node node1[4252674240] - state is now lost
Apr 23 19:08:31 node2 crmd[1634]:   notice: do_state_transition: State 
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED 
origin=crm_timer_popped ]
Apr 23 19:08:31 node2 pengine[1633]:   notice: unpack_config: On loss of CCM 
Quorum: Ignore
Apr 23 19:08:31 node2 pengine[1633]:  warning: pe_fence_node: Node node1 will 
be fenced because the node is no longer part of the cluster
Apr 23 19:08:31 node2 pengine[1633]:  warning: determine_online_status: Node 
node1 is unclean
Apr 23 19:08:31 node2 pengine[1633]:  warning: custom_action: Action 
MGS_e4a31b_stop_0 on node1 is unrunnable (offline)
Apr 23 19:08:31 node2 pengine[1633]:  warning: stage6: Scheduling Node node1 
for STONITH
Apr 23 19:08:31 node2 pengine[1633]:   notice: LogActions: Move
MGS_e4a31b#011(Started node1 -> node2)
Apr 23 19:08:31 node2 crmd[1634]:   notice: te_fence_node: Executing reboot 
fencing operation (15) on node1 (timeout=6)
Apr 23 19:08:31 node2 stonith-ng[1630]:   notice: handle_request: Client 
crmd.1634.642b9c6e wants to fence (reboot) 'node1' with device '(any)'
Apr 23 19:08:31 node2 stonith-ng[1630]:   notice: initiate_remote_stonith_op: 
Initiating remote operation reboot for node1: 
fb431eb4-789c-41bc-903e-4041d50e93b4 (0)
Apr 23 19:08:31 node2 pengine[1633]:  warning: process_pe_message: Calculated 
Transition 115: /var/lib/pacemaker/pengine/pe-warn-7.bz2
Apr 23 19:08:41 node2 stonith-ng[1630]:   notice: log_operation: Operation 
'reboot' [27682] (call 0 from crmd.1634) for host 'node1' with device 
'st-node1' returned: 0 (OK)
Apr 23 19:08:41 node2 stonith-ng[1630]:   notice: remote_op_done: Operation 
reboot of node1 by node2 for crmd.1634@node2.fb431eb4: OK
Apr 23 19:08:41 node2 crmd[1634]:   notice: tengine_stonith_callback: Stonith 
operation 3/15:115:0:c118573c-84e3-48bd-8dc9-40de24438385: OK (0)
Apr 23 19:08:41 node2 crmd[1634]:   notice: tengine_stonith_notify: Peer node1 
was terminated (st_notify_fence) by node2 for node2: OK 
(ref=fb431eb4-789c-41bc-903e-4041d50e93b4) by client crmd.1634
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] notice: pcmk_peer_update: 
Transitional membership event on ring 192: memb=1, new=0, lost=0
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: memb: 
node2 2608507072
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] notice: pcmk_peer_update: 
Stable membership event on ring 192: memb=2, new=1, lost=0
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: update_member: Node 
4252674240/node1 is now: member
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: NEW:  
node1 4252674240
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
node2 2608507072
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
node1 4252674240
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: 
send_member_notification: Sending membership update 192 to 2 children
Apr 23