Hi, I need some help with correct fencing configuration in 5-node cluster.

The speciffic issue is that there are 3 rooms, where in addition to node 
failure scenario, each room can fail too (for example in case of room power 
failure or room network failure).

room0: [ node0 ]
roomA: [ node1, node2 ]
roomB: [ node3, node4 ]

- ipmi board is present on each node
- watchdog timer is available
- shared storage is not available

Please advice, what would be a proper fencing configuration in this case.

The intention is to configure ipmi fencing (using "fence_idrac" agent) plus 
watchdog timer as a fallback. In other words, I would like to tell the 
pacemaker: "If fencing is required, try to fence via ipmi. In case of ipmi 
fence failure, after some timeout assume watchdog has rebooted the node, so it 
is safe to proceed, as if the (self)fencing had succeeded)."

>From the documentation is not clear to me whether this would be:
a) multiple fencing where ipmi would be first level and sbd would be a second 
level fencing (where sbd always succeeds)
b) or this is considered a single level fencing with a timeout

I have tried to followed option b) and create stonith resource for each node 
and setup the stonith-watchdog-timeout, like this:

---
# for each node... [0..4]
export name=...
export ip=...
export password=...
sudo pcs stonith create "fence_ipmi_$name" fence_idrac \
    lanplus=1 ip="$ip" \
    username="admin"  password="$password" \
    pcmk_host_list="$name" op monitor interval=10m timeout=10s

sudo pcs property set stonith-watchdog-timeout=20

# start dummy resource
sudo pcs resource create dummy ocf:heartbeat:Dummy op monitor interval=30s
---

I am not sure if additional location constraints have to be specified for 
stonith resources. For example: I have noticed that pacemaker will start a 
stonith resource on the same node as the fencing target. Is this OK? 

Should there be any location constraints regarding fencing and rooms?

'sbd' is running, properties are as follows:

---
$ sudo pcs property show
Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: debian
 dc-version: 2.0.3-4b1f869f0f
 have-watchdog: true
 last-lrm-refresh: 1654583431
 stonith-enabled: true
 stonith-watchdog-timeout: 20
---

Ipmi fencing (when the ipmi connection is alive) works correctly for each node. 
The watchdog timer also seems to be working correctly. The problem is that 
dummy resource is not restarted as expected.

In the test scenario, the dummy resource is currently running on node1. I have 
simulated node failure by unplugging the ipmi AND host network interfaces from 
node1. The result was that node1 gets rebooted (by watchdog), but the rest of 
the pacemaker cluster was unable to fence node1 (this is expected, since 
node1's ipmi is not accessible). The problem is that dummy resource remains 
stopped and node1 unclean. I was expecting that stonith-watchdog-timeout kicks 
in, so that dummy resource gets restarted on some other node which has quorum. 

Obviously there is something wrong with my configuration, since this seems to 
be a reasonably simple scenario for the pacemaker. Appreciate your help.

regards,
Zoran
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to