On Tue, Jun 7, 2022 at 10:27 AM Zoran Bošnjak <zoran.bosn...@via.si> wrote: > > Hi, I need some help with correct fencing configuration in 5-node cluster. > > The speciffic issue is that there are 3 rooms, where in addition to node > failure scenario, each room can fail too (for example in case of room power > failure or room network failure). > > room0: [ node0 ] > roomA: [ node1, node2 ] > roomB: [ node3, node4 ] > > - ipmi board is present on each node > - watchdog timer is available > - shared storage is not available > > Please advice, what would be a proper fencing configuration in this case. > > The intention is to configure ipmi fencing (using "fence_idrac" agent) plus > watchdog timer as a fallback. In other words, I would like to tell the > pacemaker: "If fencing is required, try to fence via ipmi. In case of ipmi > fence failure, after some timeout assume watchdog has rebooted the node, so > it is safe to proceed, as if the (self)fencing had succeeded)." > > From the documentation is not clear to me whether this would be: > a) multiple fencing where ipmi would be first level and sbd would be a second > level fencing (where sbd always succeeds) > b) or this is considered a single level fencing with a timeout
With b) falling back to watchdog-fencing wouldn't work properly although I remember some recent change that might make it fall back without issues. I would try to go for a) as with a reasonably current pacemaker-version (iirc 2.1.0 and above) you should be able to make the watchdog-fencing-device visible as with other fencing-devices (just use fence_watchdog as the fence-agent - still implemented inside pacemaker fence-watchdog-binary actually just provides the meta-data). Like this you can limit watchdog-fencing to certain-nodes that do actually provide a proper hardware-watchdog and you can add it to a topology. Depending on your infra-structure an alternative solution to using watchdog-fencing for your case (where you can't access ipmis in a room with power-outage) might be fabric-fencing. Klaus > > I have tried to followed option b) and create stonith resource for each node > and setup the stonith-watchdog-timeout, like this: > > --- > # for each node... [0..4] > export name=... > export ip=... > export password=... > sudo pcs stonith create "fence_ipmi_$name" fence_idrac \ > lanplus=1 ip="$ip" \ > username="admin" password="$password" \ > pcmk_host_list="$name" op monitor interval=10m timeout=10s > > sudo pcs property set stonith-watchdog-timeout=20 > > # start dummy resource > sudo pcs resource create dummy ocf:heartbeat:Dummy op monitor interval=30s > --- > > I am not sure if additional location constraints have to be specified for > stonith resources. For example: I have noticed that pacemaker will start a > stonith resource on the same node as the fencing target. Is this OK? > > Should there be any location constraints regarding fencing and rooms? > > 'sbd' is running, properties are as follows: > > --- > $ sudo pcs property show > Cluster Properties: > cluster-infrastructure: corosync > cluster-name: debian > dc-version: 2.0.3-4b1f869f0f > have-watchdog: true > last-lrm-refresh: 1654583431 > stonith-enabled: true > stonith-watchdog-timeout: 20 > --- > > Ipmi fencing (when the ipmi connection is alive) works correctly for each > node. The watchdog timer also seems to be working correctly. The problem is > that dummy resource is not restarted as expected. > > In the test scenario, the dummy resource is currently running on node1. I > have simulated node failure by unplugging the ipmi AND host network > interfaces from node1. The result was that node1 gets rebooted (by watchdog), > but the rest of the pacemaker cluster was unable to fence node1 (this is > expected, since node1's ipmi is not accessible). The problem is that dummy > resource remains stopped and node1 unclean. I was expecting that > stonith-watchdog-timeout kicks in, so that dummy resource gets restarted on > some other node which has quorum. > > Obviously there is something wrong with my configuration, since this seems to > be a reasonably simple scenario for the pacemaker. Appreciate your help. > > regards, > Zoran > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/