Re: [ClusterLabs] How is fencing and unfencing suppose to work?
On 2018-09-04 8:49 p.m., Ken Gaillot wrote: On Tue, 2018-08-21 at 10:23 -0500, Ryan Thomas wrote: I’m seeing unexpected behavior when using “unfencing” – I don’t think I’m understanding it correctly. I configured a resource that “requires unfencing” and have a custom fencing agent which “provides unfencing”. I perform a simple test where I setup the cluster and then run “pcs stonith fence node2”, and I see that node2 is successfully fenced by sending an “off” action to my fencing agent. But, immediately after this, I see an “on” action sent to my fencing agent. My fence agent doesn’t implement the “reboot” action, so perhaps its trying to reboot by running an off action followed by a on action. Prior to adding “provides unfencing” to the fencing agent, I didn’t see the on action. It seems unsafe to say “node2 you can’t run” and then immediately “ you can run”. I'm not as familiar with unfencing as I'd like, but I believe the basic idea is: - the fence agent's off action cuts the machine off from something essential needed to run resources (generally shared storage or network access) - the fencing works such that a fenced host is not able to request rejoining the cluster without manual intervention by a sysadmin - when the sysadmin allows the host back into the cluster, and it contacts the other nodes to rejoin, the cluster will call the fence agent's on action, which is expected to re-enable the host's access How that works in practice, I have only vague knowledge. This is correct. Consider fabric fencing where fiber channel ports are disconnected. Unfence restores the connection. Similar to a pure 'off' fence call to switched PDUs, as you mention above. Unfence powers the outlets back up. I don’t think I’m understanding this aspect of fencing/stonith. I thought that the fence agent acted as a proxy to a node, when the node was fenced, it was isolated from shared storage by some means (power, fabric, etc). It seems like it shouldn’t become unfenced until connectivity between the nodes is repaired. Yet, the node is turn “off” (isolated) and then “on” (unisolated) immediately. This (kind-of) makes sense for a fencing agent that uses power to isolate, since when it’s turned back on, pacemaker will not started any resources on that node until it sees the other nodes (due to the wait_for_all setting). However, for other types of fencing agents, it doesn’t make sense. Does the “off” action not mean isolate from shared storage? And the “on” action not mean unisolate? What is the correct way to understand fencing/stonith? I think the key idea is that "on" will be called when the fenced node asks to rejoin the cluster. So stopping that from happening until a sysadmin has intervened is an important part (if I'm not missing something). Note that if the fenced node still has network connectivity to the cluster, and the fenced node is actually operational, it will be notified by the cluster that it was fenced, and it will stop its pacemaker, thus fulfilling the requirement. But you obviously can't rely on that because fencing may be called precisely because network connectivity is lost or the host is not fully operational. The behavior I wanted to see was, when pacemaker lost connectivity to a node, it would run the off action for that node. If this succeeded, it could continue running resources. Later, when pacemaker saw the node again it would run the “on” action on the fence agent (knowing that it was no longer split-brained). Node2, would try to do the same thing, but once it was fenced, it would not longer attempt to fence node1. It also wouldn’t attempt to start any resources. I thought that adding “requires unfencing” to the resource would make this happen. Is there a way to get this behavior? That is basically what happens, the question is how "pacemaker saw the node again" becomes possible. Thanks! btw, here's the cluster configuration: pcs cluster auth node1 node2 pcs cluster setup --name ataCluster node1 node2 pcs cluster start –all pcs property set stonith-enabled=true pcs resource defaults migration-threshold=1 pcs resource create Jaws ocf:atavium:myResource op stop on-fail=fence meta requires=unfencing pcs stonith create myStonith fence_custom op monitor interval=0 meta provides=unfencing pcs property set symmetric-cluster=true ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] How is fencing and unfencing suppose to work?
Update: It seems like fencing does work as I expected it to work. The problem was with how I was testing it. I was seeing the node turned “off” (isolated) and then “on” (unisolated) immediately which seemed wrong. This was because the way I was turning the node off in my testing was to kill the some processes, including the pacemaker and corosync processes. However the systemd unit file for pacemaker/corosync is configured to restart the service immediately if it dies. So, I was seeing the "on" call immediately after the "off" because the pacemaker/corosync service was restarted, so it appeared the node I just killed, immediately came back. Thanks, Ryan On Tue, Sep 4, 2018 at 7:49 PM Ken Gaillot wrote: > On Tue, 2018-08-21 at 10:23 -0500, Ryan Thomas wrote: > > I’m seeing unexpected behavior when using “unfencing” – I don’t think > > I’m understanding it correctly. I configured a resource that > > “requires unfencing” and have a custom fencing agent which “provides > > unfencing”. I perform a simple test where I setup the cluster and > > then run “pcs stonith fence node2”, and I see that node2 is > > successfully fenced by sending an “off” action to my fencing agent. > > But, immediately after this, I see an “on” action sent to my fencing > > agent. My fence agent doesn’t implement the “reboot” action, so > > perhaps its trying to reboot by running an off action followed by a > > on action. Prior to adding “provides unfencing” to the fencing > > agent, I didn’t see the on action. It seems unsafe to say “node2 you > > can’t run” and then immediately “ you can run”. > > I'm not as familiar with unfencing as I'd like, but I believe the basic > idea is: > > - the fence agent's off action cuts the machine off from something > essential needed to run resources (generally shared storage or network > access) > > - the fencing works such that a fenced host is not able to request > rejoining the cluster without manual intervention by a sysadmin > > - when the sysadmin allows the host back into the cluster, and it > contacts the other nodes to rejoin, the cluster will call the fence > agent's on action, which is expected to re-enable the host's access > > How that works in practice, I have only vague knowledge. > > > I don’t think I’m understanding this aspect of fencing/stonith. I > > thought that the fence agent acted as a proxy to a node, when the > > node was fenced, it was isolated from shared storage by some means > > (power, fabric, etc). It seems like it shouldn’t become unfenced > > until connectivity between the nodes is repaired. Yet, the node is > > turn “off” (isolated) and then “on” (unisolated) immediately. This > > (kind-of) makes sense for a fencing agent that uses power to isolate, > > since when it’s turned back on, pacemaker will not started any > > resources on that node until it sees the other nodes (due to the > > wait_for_all setting). However, for other types of fencing agents, > > it doesn’t make sense. Does the “off” action not mean isolate from > > shared storage? And the “on” action not mean unisolate? What is the > > correct way to understand fencing/stonith? > > I think the key idea is that "on" will be called when the fenced node > asks to rejoin the cluster. So stopping that from happening until a > sysadmin has intervened is an important part (if I'm not missing > something). > > Note that if the fenced node still has network connectivity to the > cluster, and the fenced node is actually operational, it will be > notified by the cluster that it was fenced, and it will stop its > pacemaker, thus fulfilling the requirement. But you obviously can't > rely on that because fencing may be called precisely because network > connectivity is lost or the host is not fully operational. > > > The behavior I wanted to see was, when pacemaker lost connectivity to > > a node, it would run the off action for that node. If this > > succeeded, it could continue running resources. Later, when > > pacemaker saw the node again it would run the “on” action on the > > fence agent (knowing that it was no longer split-brained). Node2, > > would try to do the same thing, but once it was fenced, it would not > > longer attempt to fence node1. It also wouldn’t attempt to start any > > resources. I thought that adding “requires unfencing” to the > > resource would make this happen. Is there a way to get this > > behavior? > > That is basically what happens, the question is how "pacemaker saw the > node again" becomes possible. > > > > > Thanks! > > > > btw, here's the cluster configuration: > > > > pcs cluster auth node1 node2 > > pcs cluster setup --name ataCluster node1 node2 > > pcs cluster start –all > > pcs property set stonith-enabled=true > > pcs resource defaults migration-threshold=1 > > pcs resource create Jaws ocf:atavium:myResource op stop on-fail=fence > > meta requires=unfencing > > pcs stonith create myStonith fence_custom op monitor interval=0 meta > >
[ClusterLabs] How is fencing and unfencing suppose to work?
I’m seeing unexpected behavior when using “unfencing” – I don’t think I’m understanding it correctly. I configured a resource that “requires unfencing” and have a custom fencing agent which “provides unfencing”. I perform a simple test where I setup the cluster and then run “pcs stonith fence node2”, and I see that node2 is successfully fenced by sending an “off” action to my fencing agent. But, immediately after this, I see an “on” action sent to my fencing agent. My fence agent doesn’t implement the “reboot” action, so perhaps its trying to reboot by running an off action followed by a on action. Prior to adding “provides unfencing” to the fencing agent, I didn’t see the on action. It seems unsafe to say “node2 you can’t run” and then immediately “ you can run”. I don’t think I’m understanding this aspect of fencing/stonith. I thought that the fence agent acted as a proxy to a node, when the node was fenced, it was isolated from shared storage by some means (power, fabric, etc). It seems like it shouldn’t become unfenced until connectivity between the nodes is repaired. Yet, the node is turn “off” (isolated) and then “on” (unisolated) immediately. This (kind-of) makes sense for a fencing agent that uses power to isolate, since when it’s turned back on, pacemaker will not started any resources on that node until it sees the other nodes (due to the wait_for_all setting). However, for other types of fencing agents, it doesn’t make sense. Does the “off” action not mean isolate from shared storage? And the “on” action not mean unisolate? What is the correct way to understand fencing/stonith? The behavior I wanted to see was, when pacemaker lost connectivity to a node, it would run the off action for that node. If this succeeded, it could continue running resources. Later, when pacemaker saw the node again it would run the “on” action on the fence agent (knowing that it was no longer split-brained). Node2, would try to do the same thing, but once it was fenced, it would not longer attempt to fence node1. It also wouldn’t attempt to start any resources. I thought that adding “requires unfencing” to the resource would make this happen. Is there a way to get this behavior? Thanks! btw, here's the cluster configuration: - pcs cluster auth node1 node2 - pcs cluster setup --name ataCluster node1 node2 - pcs cluster start –all - pcs property set stonith-enabled=true - pcs resource defaults migration-threshold=1 - pcs resource create Jaws ocf:atavium:myResource op stop on-fail=fence meta requires=unfencing - pcs stonith create myStonith fence_custom op monitor interval=0 meta provides=unfencing - pcs property set symmetric-cluster=true ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org