On Tue, 2018-04-03 at 21:46 +0200, Klaus Wenninger wrote: > On 04/03/2018 05:43 PM, Ken Gaillot wrote: > > On Tue, 2018-04-03 at 07:36 +0200, Klaus Wenninger wrote: > > > On 04/02/2018 04:02 PM, Ken Gaillot wrote: > > > > On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de Rorthais > > > > wrote: > > > > > On Sun, 1 Apr 2018 09:01:15 +0300 > > > > > Andrei Borzenkov <arvidj...@gmail.com> wrote: > > > > > > > > > > > 31.03.2018 23:29, Jehan-Guillaume de Rorthais пишет: > > > > > > > Hi all, > > > > > > > > > > > > > > I experienced a problem in a two node cluster. It has one > > > > > > > FA > > > > > > > per > > > > > > > node and > > > > > > > location constraints to avoid the node each of them are > > > > > > > supposed > > > > > > > to > > > > > > > interrupt. > > > > > > > > > > > > If you mean stonith resource - for all I know location it > > > > > > does > > > > > > not > > > > > > affect stonith operations and only changes where monitoring > > > > > > action > > > > > > is > > > > > > performed. > > > > > > > > > > Sure. > > > > > > > > > > > You can create two stonith resources and declare that each > > > > > > can fence only single node, but that is not location > > > > > > constraint, it > > > > > > is > > > > > > resource configuration. Showing your configuration would be > > > > > > helpflul to > > > > > > avoid guessing. > > > > > > > > > > True, I should have done that. A conf worth thousands of > > > > > words :) > > > > > > > > > > crm conf<<EOC > > > > > > > > > > primitive fence_vm_srv1 > > > > > stonith:fence_virsh \ > > > > > params pcmk_host_check="static-list" > > > > > pcmk_host_list="srv1" \ > > > > > ipaddr="192.168.2.1" > > > > > login="<user>" \ > > > > > identity_file="/root/.ssh/id_rsa" > > > > > \ > > > > > port="srv1-d8" > > > > > action="off" \ > > > > > op monitor interval=10s > > > > > > > > > > location fence_vm_srv1-avoids-srv1 fence_vm_srv1 -inf: srv1 > > > > > > > > > > primitive fence_vm_srv2 > > > > > stonith:fence_virsh \ > > > > > params pcmk_host_check="static-list" > > > > > pcmk_host_list="srv2" \ > > > > > ipaddr="192.168.2.1" > > > > > login="<user>" \ > > > > > identity_file="/root/.ssh/id_rsa" > > > > > \ > > > > > port="srv2-d8" > > > > > action="off" \ > > > > > op monitor interval=10s > > > > > > > > > > location fence_vm_srv2-avoids-srv2 fence_vm_srv2 -inf: srv2 > > > > > > > > > > EOC > > > > > > > > > > > -inf constraints like that should effectively prevent > > > stonith-actions from being executed on that nodes. > > > > It shouldn't ... > > > > Pacemaker respects target-role=Started/Stopped for controlling > > execution of fence devices, but location (or even whether the > > device is > > "running" at all) only affects monitors, not execution. > > > > > Though there are a few issues with location constraints > > > and stonith-devices. > > > > > > When stonithd brings up the devices from the cib it > > > runs the parts of pengine that fully evaluate these > > > constraints and it would disable the stonith-device > > > if the resource is unrunable on that node. > > > > That should be true only for target-role, not everything that > > affects > > runnability > > cib_device_update bails out via a removal of the device if > - role == stopped > - node not in allowed_nodes-list of stonith-resource > - weight is negative > > Wouldn't that include a -inf rule for a node?
Well, I'll be ... I thought I understood what was going on there. :-) You're right. I've frequently seen it recommended to ban fence devices from their target when using one device per target. Perhaps it would be better to give a lower (but positive) score on the target compared to the other node(s), so it can be used when no other nodes are available. > It is of course clear that no pengine-decision to start > a stonith-resource is required for it to be used for > fencing. > > Regards, > Klaus > > > > > > But this part is not retriggered for location contraints > > > with attributes or other content that would dynamically > > > change. So one has to stick with constraints as simple > > > and static as those in the example above. > > > > > > Regarding adding/removing location constraints dynamically > > > I remember a bug that should have got fixed round 1.1.18 > > > that led to improper handling and actually usage of > > > stonith-devices disabled or banned from certain nodes. > > > > > > Regards, > > > Klaus > > > > > > > > > > During some tests, a ms resource raised an error during > > > > > > > the > > > > > > > stop > > > > > > > action on > > > > > > > both nodes. So both nodes were supposed to be fenced. > > > > > > > > > > > > In two-node cluster you can set pcmk_delay_max so that both > > > > > > nodes > > > > > > do not > > > > > > attempt fencing simultaneously. > > > > > > > > > > I'm not sure to understand the doc correctly in regard with > > > > > this > > > > > property. Does > > > > > pcmk_delay_max delay the request itself or the execution of > > > > > the > > > > > request? > > > > > > > > > > In other words, is it: > > > > > > > > > > delay -> fence query -> fencing action > > > > > > > > > > or > > > > > > > > > > fence query -> delay -> fence action > > > > > > > > > > ? > > > > > > > > > > The first definition would solve this issue, but not the > > > > > second. > > > > > As I > > > > > understand it, as soon as the fence query has been sent, the > > > > > node > > > > > status is > > > > > "UNCLEAN (online)". > > > > > > > > The latter -- you're correct, the node is already unclean by > > > > that > > > > time. > > > > Since the stop did not succeed, the node must be fenced to > > > > continue > > > > safely. > > > > > > Well, pcmk_delay_base/max are made for the case > > > where both nodes in a 2-node-cluster loose contact > > > and see the respectively other as unclean. > > > If the looser gets fenced it's view of the partner- > > > node becomes irrelevant. > > > > > > > > > > The first node did, but no FA was then able to fence the > > > > > > > second > > > > > > > one. So the > > > > > > > node stayed DC and was reported as "UNCLEAN (online)". > > > > > > > > > > > > > > We were able to fix the original ressource problem, but > > > > > > > not > > > > > > > to > > > > > > > avoid the > > > > > > > useless second node fencing. > > > > > > > > > > > > > > My questions are: > > > > > > > > > > > > > > 1. is it possible to cancel the fencing request > > > > > > > 2. is it possible reset the node status to "online" ? > > > > > > > > > > > > Not that I'm aware of. > > > > > > > > > > Argh! > > > > > > > > > > ++ > > > > > > > > You could fix the problem with the stopped service manually, > > > > then > > > > run > > > > "stonith_admin --confirm=<NODENAME>" (or higher-level tool > > > > equivalent). > > > > That tells the cluster that you took care of the issue > > > > yourself, so > > > > fencing can be considered complete. > > > > > > > > The catch there is that the cluster will assume you stopped the > > > > node, > > > > and all services on it are stopped. That could potentially > > > > cause > > > > some > > > > headaches if it's not true. I'm guessing that if you unmanaged > > > > all > > > > the > > > > resources on it first, then confirmed fencing, the cluster > > > > would > > > > detect > > > > everything properly, then you could re-manage. -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org