Re: [Pacemaker] why so long to stonith?

2013-04-25 Thread Andrew Beekhof

On 25/04/2013, at 5:22 AM, Brian J. Murrell br...@interlinx.bc.ca wrote:

 On 13-04-24 01:16 AM, Andrew Beekhof wrote:
 
 Almost certainly you are hitting:
 
https://bugzilla.redhat.com/show_bug.cgi?id=951340
 
 Yup.  The patch posted there fixed it.
 
 I am doing my best to convince people that make decisions that this is 
 worthy of an update before 6.5.
 
 I've added my voice to the bug, if that's any help.

I certainly hope so :)


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] why so long to stonith?

2013-04-25 Thread David Coulson


On 4/25/13 7:43 PM, Andrew Beekhof wrote:

I certainly hope so :)
So I should complain to our sales people about this BZ before we upgrade 
our clusters to 6.4?


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] why so long to stonith?

2013-04-25 Thread Andrew Beekhof

On 26/04/2013, at 10:24 AM, David Coulson da...@davidcoulson.net wrote:

 
 On 4/25/13 7:43 PM, Andrew Beekhof wrote:
 I certainly hope so :)
 So I should complain to our sales people about this BZ before we upgrade our 
 clusters to 6.4?

I don't think it would hurt to demonstrate how many people are using it.
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] why so long to stonith?

2013-04-25 Thread Andrew Beekhof

On 26/04/2013, at 10:24 AM, David Coulson da...@davidcoulson.net wrote:

 
 On 4/25/13 7:43 PM, Andrew Beekhof wrote:
 I certainly hope so :)
 So I should complain to our sales people about this BZ before we upgrade our 
 clusters to 6.4?

Actually, I'm going to back-track on this.
After further investigation it appears only plugin based clusters (ie. those 
using corosync.conf) are affected.  
You won't have any problems if, as recommended, you use cman.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] why so long to stonith?

2013-04-24 Thread Brian J. Murrell
On 13-04-24 01:16 AM, Andrew Beekhof wrote:
 
 Almost certainly you are hitting:
 
 https://bugzilla.redhat.com/show_bug.cgi?id=951340

Yup.  The patch posted there fixed it.

 I am doing my best to convince people that make decisions that this is worthy 
 of an update before 6.5.

I've added my voice to the bug, if that's any help.

b.




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] why so long to stonith?

2013-04-23 Thread Brian J. Murrell
Using pacemaker 1.1.8 on RHEL 6.4, I did a test where I just killed
(-KILL) corosync on a peer node.  Pacemaker seemed to take a long time
to transition to stonithing it though after noticing it was AWOL:

Apr 23 19:05:20 node2 corosync[1324]:   [TOTEM ] A processor failed, forming 
new configuration.
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] notice: pcmk_peer_update: 
Transitional membership event on ring 188: memb=1, new=0, lost=1
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: memb: 
node2 2608507072
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: lost: 
node1 4252674240
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] notice: pcmk_peer_update: 
Stable membership event on ring 188: memb=1, new=0, lost=0
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
node2 2608507072
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: 
ais_mark_unseen_peer_dead: Node node1 was not seen in the previous transition
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: update_member: Node 
4252674240/node1 is now: lost
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: 
send_member_notification: Sending membership update 188 to 2 children
Apr 23 19:05:21 node2 corosync[1324]:   [TOTEM ] A processor joined or left the 
membership and a new membership was formed.
Apr 23 19:05:21 node2 corosync[1324]:   [CPG   ] chosen downlist: sender r(0) 
ip(192.168.122.155) ; members(old:2 left:1)
Apr 23 19:05:21 node2 corosync[1324]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Apr 23 19:05:21 node2 crmd[1634]:   notice: ais_dispatch_message: Membership 
188: quorum lost
Apr 23 19:05:21 node2 crmd[1634]:   notice: crm_update_peer_state: 
crm_update_ais_node: Node node1[4252674240] - state is now lost
Apr 23 19:05:21 node2 cib[1629]:   notice: ais_dispatch_message: Membership 
188: quorum lost
Apr 23 19:05:21 node2 cib[1629]:   notice: crm_update_peer_state: 
crm_update_ais_node: Node node1[4252674240] - state is now lost
Apr 23 19:08:31 node2 crmd[1634]:   notice: do_state_transition: State 
transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED 
origin=crm_timer_popped ]
Apr 23 19:08:31 node2 pengine[1633]:   notice: unpack_config: On loss of CCM 
Quorum: Ignore
Apr 23 19:08:31 node2 pengine[1633]:  warning: pe_fence_node: Node node1 will 
be fenced because the node is no longer part of the cluster
Apr 23 19:08:31 node2 pengine[1633]:  warning: determine_online_status: Node 
node1 is unclean
Apr 23 19:08:31 node2 pengine[1633]:  warning: custom_action: Action 
MGS_e4a31b_stop_0 on node1 is unrunnable (offline)
Apr 23 19:08:31 node2 pengine[1633]:  warning: stage6: Scheduling Node node1 
for STONITH
Apr 23 19:08:31 node2 pengine[1633]:   notice: LogActions: Move
MGS_e4a31b#011(Started node1 - node2)
Apr 23 19:08:31 node2 crmd[1634]:   notice: te_fence_node: Executing reboot 
fencing operation (15) on node1 (timeout=6)
Apr 23 19:08:31 node2 stonith-ng[1630]:   notice: handle_request: Client 
crmd.1634.642b9c6e wants to fence (reboot) 'node1' with device '(any)'
Apr 23 19:08:31 node2 stonith-ng[1630]:   notice: initiate_remote_stonith_op: 
Initiating remote operation reboot for node1: 
fb431eb4-789c-41bc-903e-4041d50e93b4 (0)
Apr 23 19:08:31 node2 pengine[1633]:  warning: process_pe_message: Calculated 
Transition 115: /var/lib/pacemaker/pengine/pe-warn-7.bz2
Apr 23 19:08:41 node2 stonith-ng[1630]:   notice: log_operation: Operation 
'reboot' [27682] (call 0 from crmd.1634) for host 'node1' with device 
'st-node1' returned: 0 (OK)
Apr 23 19:08:41 node2 stonith-ng[1630]:   notice: remote_op_done: Operation 
reboot of node1 by node2 for crmd.1634@node2.fb431eb4: OK
Apr 23 19:08:41 node2 crmd[1634]:   notice: tengine_stonith_callback: Stonith 
operation 3/15:115:0:c118573c-84e3-48bd-8dc9-40de24438385: OK (0)
Apr 23 19:08:41 node2 crmd[1634]:   notice: tengine_stonith_notify: Peer node1 
was terminated (st_notify_fence) by node2 for node2: OK 
(ref=fb431eb4-789c-41bc-903e-4041d50e93b4) by client crmd.1634
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] notice: pcmk_peer_update: 
Transitional membership event on ring 192: memb=1, new=0, lost=0
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: memb: 
node2 2608507072
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] notice: pcmk_peer_update: 
Stable membership event on ring 192: memb=2, new=1, lost=0
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: update_member: Node 
4252674240/node1 is now: member
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: NEW:  
node1 4252674240
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
node2 2608507072
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
node1 4252674240
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: 
send_member_notification: Sending membership update 192 to 2 children
Apr 23 

Re: [Pacemaker] why so long to stonith?

2013-04-23 Thread Digimer
As I understand it, this is a known issue with the 1.1.8 release. I 
believe that 1.1.9 is now available from the pacemaker repos and it 
should fix the problem.


digimer

On 04/23/2013 03:34 PM, Brian J. Murrell wrote:

Using pacemaker 1.1.8 on RHEL 6.4, I did a test where I just killed
(-KILL) corosync on a peer node.  Pacemaker seemed to take a long time
to transition to stonithing it though after noticing it was AWOL:

Apr 23 19:05:20 node2 corosync[1324]:   [TOTEM ] A processor failed, forming 
new configuration.
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] notice: pcmk_peer_update: 
Transitional membership event on ring 188: memb=1, new=0, lost=1
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: memb: 
node2 2608507072
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: lost: 
node1 4252674240
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] notice: pcmk_peer_update: 
Stable membership event on ring 188: memb=1, new=0, lost=0
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
node2 2608507072
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: 
ais_mark_unseen_peer_dead: Node node1 was not seen in the previous transition
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: update_member: Node 
4252674240/node1 is now: lost
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: 
send_member_notification: Sending membership update 188 to 2 children
Apr 23 19:05:21 node2 corosync[1324]:   [TOTEM ] A processor joined or left the 
membership and a new membership was formed.
Apr 23 19:05:21 node2 corosync[1324]:   [CPG   ] chosen downlist: sender r(0) 
ip(192.168.122.155) ; members(old:2 left:1)
Apr 23 19:05:21 node2 corosync[1324]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Apr 23 19:05:21 node2 crmd[1634]:   notice: ais_dispatch_message: Membership 
188: quorum lost
Apr 23 19:05:21 node2 crmd[1634]:   notice: crm_update_peer_state: 
crm_update_ais_node: Node node1[4252674240] - state is now lost
Apr 23 19:05:21 node2 cib[1629]:   notice: ais_dispatch_message: Membership 
188: quorum lost
Apr 23 19:05:21 node2 cib[1629]:   notice: crm_update_peer_state: 
crm_update_ais_node: Node node1[4252674240] - state is now lost
Apr 23 19:08:31 node2 crmd[1634]:   notice: do_state_transition: State transition 
S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED 
origin=crm_timer_popped ]
Apr 23 19:08:31 node2 pengine[1633]:   notice: unpack_config: On loss of CCM 
Quorum: Ignore
Apr 23 19:08:31 node2 pengine[1633]:  warning: pe_fence_node: Node node1 will 
be fenced because the node is no longer part of the cluster
Apr 23 19:08:31 node2 pengine[1633]:  warning: determine_online_status: Node 
node1 is unclean
Apr 23 19:08:31 node2 pengine[1633]:  warning: custom_action: Action 
MGS_e4a31b_stop_0 on node1 is unrunnable (offline)
Apr 23 19:08:31 node2 pengine[1633]:  warning: stage6: Scheduling Node node1 
for STONITH
Apr 23 19:08:31 node2 pengine[1633]:   notice: LogActions: Move
MGS_e4a31b#011(Started node1 - node2)
Apr 23 19:08:31 node2 crmd[1634]:   notice: te_fence_node: Executing reboot 
fencing operation (15) on node1 (timeout=6)
Apr 23 19:08:31 node2 stonith-ng[1630]:   notice: handle_request: Client 
crmd.1634.642b9c6e wants to fence (reboot) 'node1' with device '(any)'
Apr 23 19:08:31 node2 stonith-ng[1630]:   notice: initiate_remote_stonith_op: 
Initiating remote operation reboot for node1: 
fb431eb4-789c-41bc-903e-4041d50e93b4 (0)
Apr 23 19:08:31 node2 pengine[1633]:  warning: process_pe_message: Calculated 
Transition 115: /var/lib/pacemaker/pengine/pe-warn-7.bz2
Apr 23 19:08:41 node2 stonith-ng[1630]:   notice: log_operation: Operation 
'reboot' [27682] (call 0 from crmd.1634) for host 'node1' with device 
'st-node1' returned: 0 (OK)
Apr 23 19:08:41 node2 stonith-ng[1630]:   notice: remote_op_done: Operation 
reboot of node1 by node2 for crmd.1634@node2.fb431eb4: OK
Apr 23 19:08:41 node2 crmd[1634]:   notice: tengine_stonith_callback: Stonith 
operation 3/15:115:0:c118573c-84e3-48bd-8dc9-40de24438385: OK (0)
Apr 23 19:08:41 node2 crmd[1634]:   notice: tengine_stonith_notify: Peer node1 
was terminated (st_notify_fence) by node2 for node2: OK 
(ref=fb431eb4-789c-41bc-903e-4041d50e93b4) by client crmd.1634
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] notice: pcmk_peer_update: 
Transitional membership event on ring 192: memb=1, new=0, lost=0
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: memb: 
node2 2608507072
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] notice: pcmk_peer_update: 
Stable membership event on ring 192: memb=2, new=1, lost=0
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: update_member: Node 
4252674240/node1 is now: member
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: NEW:  
node1 4252674240
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: MEMB: 
node2 2608507072
Apr 23