Re: [Pacemaker] catch-22: can't fence node A because node A has the fencing resource

2014-01-08 Thread Andrew Beekhof

On 4 Dec 2013, at 11:47 am, Brian J. Murrell br...@interlinx.bc.ca wrote:

 
 On Tue, 2013-12-03 at 18:26 -0500, David Vossel wrote: 
 
 We did away with all of the policy engine logic involved with trying to move 
 fencing devices off of the target node before executing the fencing action. 
 Behind the scenes all fencing devices are now essentially clones.  If the 
 target node to be fenced has a fencing device running on it, that device can 
 execute anywhere in the cluster to avoid the suicide situation.
 
 OK.
 
 When you are looking at crm_mon output and see a fencing device is running 
 on a specific node, all that really means is that we are going to attempt to 
 execute fencing actions for that device from that node first. If that node 
 is unavailable,
 
 Would it be better to not even try to use a node and ask it to commit
 suicide but always try to use another node?

IIRC the only time we ask a node to fence itself is when it is (or thinks it 
is) the last node standing.

 
 we'll try that same device anywhere in the cluster we can get it to work
 
 OK.
 
 (unless you've specifically built some location constraint that prevents the 
 fencing device from ever running on a specific node)
 
 While I do have constraints on the more service-oriented resources to
 give them preferred nodes, I don't have any constraints on the fencing
 resources.
 
 So given all of the above, and given the log I supplied showing that the
 fencing was just not being attempted anywhere other than the node to be
 fenced (which was down during that log) any clues as to where to look
 for why?
 
 Hope that helps.
 
 It explains the differences, but unfortunately I'm still not sure why it
 wouldn't get run somewhere else, eventually, rather than continually
 being attempted on the node to be killed (which as I mentioned, was shut
 down at the time the log was made).

Yes, this is surprising.
Can you enable the blackbox for stonith-ng, reproduce and generate a crm_report 
for us please?  It will contain all the information we need.


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] catch-22: can't fence node A because node A has the fencing resource

2013-12-04 Thread Lars Marowsky-Bree
On 2013-12-03T19:47:41, Brian J. Murrell br...@interlinx.bc.ca wrote:

 So given all of the above, and given the log I supplied showing that the
 fencing was just not being attempted anywhere other than the node to be
 fenced (which was down during that log) any clues as to where to look
 for why?

As far as I saw in your logs, you got a timeout (when host2 tried to
fence host1). That doesn't seem to be related to this change.

 It explains the differences, but unfortunately I'm still not sure why it
 wouldn't get run somewhere else, eventually, rather than continually
 being attempted on the node to be killed (which as I mentioned, was shut
 down at the time the log was made).

I think there was a fix related to this in post-1.1.10 git. Perhaps you
can try that?

Regards,
Lars

(For the record, this change in semantics and behaviour has caused quite
some support questions here too. I didn't really like it either, but
apparently, I'm just a whiner ;-)

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] catch-22: can't fence node A because node A has the fencing resource

2013-12-03 Thread David Vossel
- Original Message -
 From: Brian J. Murrell br...@interlinx.bc.ca
 To: pacema...@clusterlabs.org
 Sent: Monday, December 2, 2013 2:50:41 PM
 Subject: [Pacemaker] catch-22: can't fence node A because node A has the  
 fencing resource
 
 So, I'm migrating my working pacemaker configuration from 1.1.7 to
 1.1.10 and am finding what appears to be a new behavior in 1.1.10.
 
 If a given node is running a fencing resource and that node goes AWOL,
 it needs to be fenced (of course).  But any other node trying to take
 over the fencing resource to fence it appears to first want the current
 owner of the fencing resource to fence the node.  Of course that can't
 happen since the node that needs to do the fencing is AWOL.
 
 So while I can buy into the general policy that a node needs to be
 fenced in order to take over it's resources, fencing resources have to
 be excepted from this or there can be this catch-22.

We did away with all of the policy engine logic involved with trying to move 
fencing devices off of the target node before executing the fencing action. 
Behind the scenes all fencing devices are now essentially clones.  If the 
target node to be fenced has a fencing device running on it, that device can 
execute anywhere in the cluster to avoid the suicide situation.

When you are looking at crm_mon output and see a fencing device is running on a 
specific node, all that really means is that we are going to attempt to execute 
fencing actions for that device from that node first. If that node is 
unavailable, we'll try that same device anywhere in the cluster we can get it 
to work (unless you've specifically built some location constraint that 
prevents the fencing device from ever running on a specific node)

Hope that helps.

-- Vossel

 
 I believe that is how things were working in 1.1.7 but now that I'm on
 1.1.10[-1.el6_4.4] this no longer seems to be the case.
 
 Or perhaps there is some additional configuration that 1.1.10 needs to
 effect this behavior.  Here is my configuration:
 
 Cluster Name:
 Corosync Nodes:
  
 Pacemaker Nodes:
  host1 host2
 
 Resources:
  Resource: rsc1 (class=ocf provider=foo type=Target)
   Attributes: target=111bad0a-a86a-40e3-b056-c5c93168aa0d
   Meta Attrs: target-role=Started
   Operations: monitor interval=5 timeout=60 (rsc1-monitor-5)
   start interval=0 timeout=300 (rsc1-start-0)
   stop interval=0 timeout=300 (rsc1-stop-0)
  Resource: rsc2 (class=ocf provider=chroma type=Target)
   Attributes: target=a8efa349-4c73-4efc-90d3-d6be7d73c515
   Meta Attrs: target-role=Started
   Operations: monitor interval=5 timeout=60 (rsc2-monitor-5)
   start interval=0 timeout=300 (rsc2-start-0)
   stop interval=0 timeout=300 (rsc2-stop-0)
 
 Stonith Devices:
  Resource: st-fencing (class=stonith type=fence_foo)
 Fencing Levels:
 
 Location Constraints:
   Resource: rsc1
 Enabled on: host1 (score:20) (id:rsc1-primary)
 Enabled on: host2 (score:10) (id:rsc1-secondary)
   Resource: rsc2
 Enabled on: host2 (score:20) (id:rsc2-primary)
 Enabled on: host1 (score:10) (id:rsc2-secondary)
 Ordering Constraints:
 Colocation Constraints:
 
 Cluster Properties:
  cluster-infrastructure: classic openais (with plugin)
  dc-version: 1.1.10-1.el6_4.4-368c726
  expected-quorum-votes: 2
  no-quorum-policy: ignore
  stonith-enabled: true
  symmetric-cluster: true
 
 One thing that PCS is not showing that might be relevant here is that I
 have a a resource stickiness value set to 1000 to prevent resources from
 failing back to nodes after a failover.
 
 With the above configuration if host1 is shut down, host2 just spins in
 a loop doing:
 
 Dec  2 20:00:02 host2 pengine[8923]:  warning: pe_fence_node: Node host1 will
 be fenced because the node is no longer part of the cluster
 Dec  2 20:00:02 host2 pengine[8923]:  warning: determine_online_status: Node
 host1 is unclean
 Dec  2 20:00:02 host2 pengine[8923]:  warning: custom_action: Action
 st-fencing_stop_0 on host1 is unrunnable (offline)
 Dec  2 20:00:02 host2 pengine[8923]:  warning: custom_action: Action
 rsc1_stop_0 on host1 is unrunnable (offline)
 Dec  2 20:00:02 host2 pengine[8923]:  warning: stage6: Scheduling Node host1
 for STONITH
 Dec  2 20:00:02 host2 pengine[8923]:   notice: LogActions: Move
 st-fencing#011(Started host1 - host2)
 Dec  2 20:00:02 host2 pengine[8923]:   notice: LogActions: Move
 rsc1#011(Started host1 - host2)
 Dec  2 20:00:02 host2 crmd[8924]:   notice: te_fence_node: Executing reboot
 fencing operation (13) on host1 (timeout=6)
 Dec  2 20:00:02 host2 stonith-ng[8920]:   notice: handle_request: Client
 crmd.8924.39504cd3 wants to fence (reboot) 'host1' with device '(any)'
 Dec  2 20:00:02 host2 stonith-ng[8920]:   notice: initiate_remote_stonith_op:
 Initiating remote operation reboot for host1:
 ad69ead5-0bbb-45d8-bb07-30bcd405ace2 (0)
 Dec  2 20:00:02 host2 pengine[8923]:  warning: process_pe_message: Calculated
 Transition 22: 

Re: [Pacemaker] catch-22: can't fence node A because node A has the fencing resource

2013-12-03 Thread Brian J. Murrell

On Tue, 2013-12-03 at 18:26 -0500, David Vossel wrote: 
 
 We did away with all of the policy engine logic involved with trying to move 
 fencing devices off of the target node before executing the fencing action. 
 Behind the scenes all fencing devices are now essentially clones.  If the 
 target node to be fenced has a fencing device running on it, that device can 
 execute anywhere in the cluster to avoid the suicide situation.

OK.

 When you are looking at crm_mon output and see a fencing device is running on 
 a specific node, all that really means is that we are going to attempt to 
 execute fencing actions for that device from that node first. If that node is 
 unavailable,

Would it be better to not even try to use a node and ask it to commit
suicide but always try to use another node?

 we'll try that same device anywhere in the cluster we can get it to work

OK.

 (unless you've specifically built some location constraint that prevents the 
 fencing device from ever running on a specific node)

While I do have constraints on the more service-oriented resources to
give them preferred nodes, I don't have any constraints on the fencing
resources.

So given all of the above, and given the log I supplied showing that the
fencing was just not being attempted anywhere other than the node to be
fenced (which was down during that log) any clues as to where to look
for why?

 Hope that helps.

It explains the differences, but unfortunately I'm still not sure why it
wouldn't get run somewhere else, eventually, rather than continually
being attempted on the node to be killed (which as I mentioned, was shut
down at the time the log was made).

Cheers,
b.







signature.asc
Description: This is a digitally signed message part
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] catch-22: can't fence node A because node A has the fencing resource

2013-12-03 Thread Nikita Staroverov



It explains the differences, but unfortunately I'm still not sure why it
wouldn't get run somewhere else, eventually, rather than continually
being attempted on the node to be killed (which as I mentioned, was shut
down at the time the log was made).

Cheers,
b.


May be fence devices were started on other nodes but failed to do so.
Afaik, failed fence device doesn't start on node before failure-timeout 
expired,
so pacemaker tries to do fencing on last known good node, despite of 
node state.

We need more logs :)

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] catch-22: can't fence node A because node A has the fencing resource

2013-12-03 Thread Andrey Groshev


04.12.2013, 03:30, David Vossel dvos...@redhat.com:
 - Original Message -

  From: Brian J. Murrell br...@interlinx.bc.ca
  To: pacema...@clusterlabs.org
  Sent: Monday, December 2, 2013 2:50:41 PM
  Subject: [Pacemaker] catch-22: can't fence node A because node A has the 
 fencing resource

  So, I'm migrating my working pacemaker configuration from 1.1.7 to
  1.1.10 and am finding what appears to be a new behavior in 1.1.10.

  If a given node is running a fencing resource and that node goes AWOL,
  it needs to be fenced (of course).  But any other node trying to take
  over the fencing resource to fence it appears to first want the current
  owner of the fencing resource to fence the node.  Of course that can't
  happen since the node that needs to do the fencing is AWOL.

  So while I can buy into the general policy that a node needs to be
  fenced in order to take over it's resources, fencing resources have to
  be excepted from this or there can be this catch-22.

 We did away with all of the policy engine logic involved with trying to move 
 fencing devices off of the target node before executing the fencing action. 
 Behind the scenes all fencing devices are now essentially clones.  If the 
 target node to be fenced has a fencing device running on it, that device can 
 execute anywhere in the cluster to avoid the suicide situation.

 When you are looking at crm_mon output and see a fencing device is running on 
 a specific node, all that really means is that we are going to attempt to 
 execute fencing actions for that device from that node first. 

Means... means... means... 
There are baseline principles of programming, one of which is obvious better 
not obvious.


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] catch-22: can't fence node A because node A has the fencing resource

2013-12-02 Thread Brian J. Murrell
So, I'm migrating my working pacemaker configuration from 1.1.7 to
1.1.10 and am finding what appears to be a new behavior in 1.1.10.

If a given node is running a fencing resource and that node goes AWOL,
it needs to be fenced (of course).  But any other node trying to take
over the fencing resource to fence it appears to first want the current
owner of the fencing resource to fence the node.  Of course that can't
happen since the node that needs to do the fencing is AWOL.

So while I can buy into the general policy that a node needs to be
fenced in order to take over it's resources, fencing resources have to
be excepted from this or there can be this catch-22.

I believe that is how things were working in 1.1.7 but now that I'm on
1.1.10[-1.el6_4.4] this no longer seems to be the case.

Or perhaps there is some additional configuration that 1.1.10 needs to
effect this behavior.  Here is my configuration:

Cluster Name: 
Corosync Nodes:
 
Pacemaker Nodes:
 host1 host2 

Resources: 
 Resource: rsc1 (class=ocf provider=foo type=Target)
  Attributes: target=111bad0a-a86a-40e3-b056-c5c93168aa0d 
  Meta Attrs: target-role=Started 
  Operations: monitor interval=5 timeout=60 (rsc1-monitor-5)
  start interval=0 timeout=300 (rsc1-start-0)
  stop interval=0 timeout=300 (rsc1-stop-0)
 Resource: rsc2 (class=ocf provider=chroma type=Target)
  Attributes: target=a8efa349-4c73-4efc-90d3-d6be7d73c515 
  Meta Attrs: target-role=Started 
  Operations: monitor interval=5 timeout=60 (rsc2-monitor-5)
  start interval=0 timeout=300 (rsc2-start-0)
  stop interval=0 timeout=300 (rsc2-stop-0)

Stonith Devices: 
 Resource: st-fencing (class=stonith type=fence_foo)
Fencing Levels: 

Location Constraints:
  Resource: rsc1
Enabled on: host1 (score:20) (id:rsc1-primary)
Enabled on: host2 (score:10) (id:rsc1-secondary)
  Resource: rsc2
Enabled on: host2 (score:20) (id:rsc2-primary)
Enabled on: host1 (score:10) (id:rsc2-secondary)
Ordering Constraints:
Colocation Constraints:

Cluster Properties:
 cluster-infrastructure: classic openais (with plugin)
 dc-version: 1.1.10-1.el6_4.4-368c726
 expected-quorum-votes: 2
 no-quorum-policy: ignore
 stonith-enabled: true
 symmetric-cluster: true

One thing that PCS is not showing that might be relevant here is that I
have a a resource stickiness value set to 1000 to prevent resources from
failing back to nodes after a failover.

With the above configuration if host1 is shut down, host2 just spins in
a loop doing:

Dec  2 20:00:02 host2 pengine[8923]:  warning: pe_fence_node: Node host1 will 
be fenced because the node is no longer part of the cluster
Dec  2 20:00:02 host2 pengine[8923]:  warning: determine_online_status: Node 
host1 is unclean
Dec  2 20:00:02 host2 pengine[8923]:  warning: custom_action: Action 
st-fencing_stop_0 on host1 is unrunnable (offline)
Dec  2 20:00:02 host2 pengine[8923]:  warning: custom_action: Action 
rsc1_stop_0 on host1 is unrunnable (offline)
Dec  2 20:00:02 host2 pengine[8923]:  warning: stage6: Scheduling Node host1 
for STONITH
Dec  2 20:00:02 host2 pengine[8923]:   notice: LogActions: Move
st-fencing#011(Started host1 - host2)
Dec  2 20:00:02 host2 pengine[8923]:   notice: LogActions: Move
rsc1#011(Started host1 - host2)
Dec  2 20:00:02 host2 crmd[8924]:   notice: te_fence_node: Executing reboot 
fencing operation (13) on host1 (timeout=6)
Dec  2 20:00:02 host2 stonith-ng[8920]:   notice: handle_request: Client 
crmd.8924.39504cd3 wants to fence (reboot) 'host1' with device '(any)'
Dec  2 20:00:02 host2 stonith-ng[8920]:   notice: initiate_remote_stonith_op: 
Initiating remote operation reboot for host1: 
ad69ead5-0bbb-45d8-bb07-30bcd405ace2 (0)
Dec  2 20:00:02 host2 pengine[8923]:  warning: process_pe_message: Calculated 
Transition 22: /var/lib/pacemaker/pengine/pe-warn-2.bz2  
Dec  2 20:01:14 host2 stonith-ng[8920]:error: remote_op_done: Operation 
reboot of host1 by host2 for crmd.8924@host2.ad69ead5: Timer expired
Dec  2 20:01:14 host2 crmd[8924]:   notice: tengine_stonith_callback: Stonith 
operation 4/13:22:0:0171e376-182e-485f-a484-9e638e1bd355: Timer expired (-62)
Dec  2 20:01:14 host2 crmd[8924]:   notice: tengine_stonith_callback: Stonith 
operation 4 for host1 failed (Timer expired): aborting transition.
Dec  2 20:01:14 host2 crmd[8924]:   notice: tengine_stonith_notify: Peer host1 
was not terminated (reboot) by host2 for host2: Timer expired 
(ref=ad69ead5-0bbb-45d8-bb07-30bcd405ace2) by client crmd.8924
Dec  2 20:01:14 host2 crmd[8924]:   notice: run_graph: Transition 22 
(Complete=1, Pending=0, Fired=0, Skipped=7, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
Dec  2 20:01:14 host2 pengine[8923]:   notice: unpack_config: On loss of CCM 
Quorum: Ignore
Dec  2 20:01:14 host2 pengine[8923]:  warning: pe_fence_node: Node host1 will 
be fenced because the node is no longer part of the cluster  
Dec  2 20:01:14 host2 pengine[8923]:  warning: