On Tue, 2018-04-03 at 07:36 +0200, Klaus Wenninger wrote: > On 04/02/2018 04:02 PM, Ken Gaillot wrote: > > On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de Rorthais > > wrote: > > > On Sun, 1 Apr 2018 09:01:15 +0300 > > > Andrei Borzenkov <arvidj...@gmail.com> wrote: > > > > > > > 31.03.2018 23:29, Jehan-Guillaume de Rorthais пишет: > > > > > Hi all, > > > > > > > > > > I experienced a problem in a two node cluster. It has one FA > > > > > per > > > > > node and > > > > > location constraints to avoid the node each of them are > > > > > supposed > > > > > to > > > > > interrupt. > > > > > > > > If you mean stonith resource - for all I know location it does > > > > not > > > > affect stonith operations and only changes where monitoring > > > > action > > > > is > > > > performed. > > > > > > Sure. > > > > > > > You can create two stonith resources and declare that each > > > > can fence only single node, but that is not location > > > > constraint, it > > > > is > > > > resource configuration. Showing your configuration would be > > > > helpflul to > > > > avoid guessing. > > > > > > True, I should have done that. A conf worth thousands of words :) > > > > > > crm conf<<EOC > > > > > > primitive fence_vm_srv1 stonith:fence_virsh \ > > > params pcmk_host_check="static-list" pcmk_host_list="srv1" \ > > > ipaddr="192.168.2.1" login="<user>" \ > > > identity_file="/root/.ssh/id_rsa" \ > > > port="srv1-d8" action="off" \ > > > op monitor interval=10s > > > > > > location fence_vm_srv1-avoids-srv1 fence_vm_srv1 -inf: srv1 > > > > > > primitive fence_vm_srv2 stonith:fence_virsh \ > > > params pcmk_host_check="static-list" pcmk_host_list="srv2" \ > > > ipaddr="192.168.2.1" login="<user>" \ > > > identity_file="/root/.ssh/id_rsa" \ > > > port="srv2-d8" action="off" \ > > > op monitor interval=10s > > > > > > location fence_vm_srv2-avoids-srv2 fence_vm_srv2 -inf: srv2 > > > > > > EOC > > > > > -inf constraints like that should effectively prevent > stonith-actions from being executed on that nodes.
It shouldn't ... Pacemaker respects target-role=Started/Stopped for controlling execution of fence devices, but location (or even whether the device is "running" at all) only affects monitors, not execution. > Though there are a few issues with location constraints > and stonith-devices. > > When stonithd brings up the devices from the cib it > runs the parts of pengine that fully evaluate these > constraints and it would disable the stonith-device > if the resource is unrunable on that node. That should be true only for target-role, not everything that affects runnability > But this part is not retriggered for location contraints > with attributes or other content that would dynamically > change. So one has to stick with constraints as simple > and static as those in the example above. > > Regarding adding/removing location constraints dynamically > I remember a bug that should have got fixed round 1.1.18 > that led to improper handling and actually usage of > stonith-devices disabled or banned from certain nodes. > > Regards, > Klaus > > > > > > During some tests, a ms resource raised an error during the > > > > > stop > > > > > action on > > > > > both nodes. So both nodes were supposed to be fenced. > > > > > > > > In two-node cluster you can set pcmk_delay_max so that both > > > > nodes > > > > do not > > > > attempt fencing simultaneously. > > > > > > I'm not sure to understand the doc correctly in regard with this > > > property. Does > > > pcmk_delay_max delay the request itself or the execution of the > > > request? > > > > > > In other words, is it: > > > > > > delay -> fence query -> fencing action > > > > > > or > > > > > > fence query -> delay -> fence action > > > > > > ? > > > > > > The first definition would solve this issue, but not the second. > > > As I > > > understand it, as soon as the fence query has been sent, the node > > > status is > > > "UNCLEAN (online)". > > > > The latter -- you're correct, the node is already unclean by that > > time. > > Since the stop did not succeed, the node must be fenced to continue > > safely. > > Well, pcmk_delay_base/max are made for the case > where both nodes in a 2-node-cluster loose contact > and see the respectively other as unclean. > If the looser gets fenced it's view of the partner- > node becomes irrelevant. > > > > > > The first node did, but no FA was then able to fence the > > > > > second > > > > > one. So the > > > > > node stayed DC and was reported as "UNCLEAN (online)". > > > > > > > > > > We were able to fix the original ressource problem, but not > > > > > to > > > > > avoid the > > > > > useless second node fencing. > > > > > > > > > > My questions are: > > > > > > > > > > 1. is it possible to cancel the fencing request > > > > > 2. is it possible reset the node status to "online" ? > > > > > > > > Not that I'm aware of. > > > > > > Argh! > > > > > > ++ > > > > You could fix the problem with the stopped service manually, then > > run > > "stonith_admin --confirm=<NODENAME>" (or higher-level tool > > equivalent). > > That tells the cluster that you took care of the issue yourself, so > > fencing can be considered complete. > > > > The catch there is that the cluster will assume you stopped the > > node, > > and all services on it are stopped. That could potentially cause > > some > > headaches if it's not true. I'm guessing that if you unmanaged all > > the > > resources on it first, then confirmed fencing, the cluster would > > detect > > everything properly, then you could re-manage. > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. > pdf > Bugs: http://bugs.clusterlabs.org -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org