Hi Strahil, Here is the output of those commands.... I appreciate the help!
# crm config show node 1: ceha03 \ attributes ethmonitor-ens192=1 node 2: ceha04 \ attributes ethmonitor-ens192=1 (...) primitive stonith_sbd stonith:fence_sbd \ params devices="/dev/sde1" \ meta is-managed=true (...) property cib-bootstrap-options: \ have-watchdog=true \ dc-version=2.0.2-1.el8-744a30d655 \ cluster-infrastructure=corosync \ cluster-name=ps_dom \ stonith-enabled=true \ no-quorum-policy=ignore \ stop-all-resources=false \ cluster-recheck-interval=60 \ symmetric-cluster=true \ stonith-watchdog-timeout=0 rsc_defaults rsc-options: \ is-managed=false \ resource-stickiness=0 \ failure-timeout=1min # cat /etc/sysconfig/sbd SBD_DEVICE="/dev/sde1" SBD_PACEMAKER=yes SBD_STARTMODE=always SBD_DELAY_START=no SBD_WATCHDOG_DEV=/dev/watchdog SBD_WATCHDOG_TIMEOUT=5 SBD_TIMEOUT_ACTION=flush,reboot SBD_MOVE_TO_ROOT_CGROUP=auto SBD_OPTS= # systemctl status sbd sbd.service - Shared-storage based fencing daemon Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2020-09-21 18:36:28 EDT; 15min ago Docs: man:sbd(8) Process: 12810 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid watch (code=exited, status=0/SUCCESS) Main PID: 12812 (sbd) Tasks: 4 (limit: 26213) Memory: 14.5M CGroup: /system.slice/sbd.service \u251c\u250012812 sbd: inquisitor \u251c\u250012814 sbd: watcher: /dev/sde1 - slot: 0 - uuid: 94d67f15-e301-4fa9-89ae-e3ce2e82c9e7 \u251c\u250012815 sbd: watcher: Pacemaker \u2514\u250012816 sbd: watcher: Cluster Sep 21 18:36:27 ceha03.canlab.ibm.com systemd[1]: Starting Shared-storage based fencing daemon... Sep 21 18:36:27 ceha03.canlab.ibm.com sbd[12810]: notice: main: Doing flush + writing 'b' to sysrq on timeout Sep 21 18:36:27 ceha03.canlab.ibm.com sbd[12815]: pcmk: notice: servant_pcmk: Monitoring Pacemaker health Sep 21 18:36:27 ceha03.canlab.ibm.com sbd[12816]: cluster: notice: servant_cluster: Monitoring unknown cluster health Sep 21 18:36:27 ceha03.canlab.ibm.com sbd[12814]: /dev/sde1: notice: servant_md: Monitoring slot 0 on disk /dev/sde1 Sep 21 18:36:28 ceha03.canlab.ibm.com sbd[12812]: notice: watchdog_init: Using watchdog device '/dev/watchdog' Sep 21 18:36:28 ceha03.canlab.ibm.com sbd[12816]: cluster: notice: sbd_get_two_node: Corosync is in 2Node-mode Sep 21 18:36:28 ceha03.canlab.ibm.com sbd[12812]: notice: inquisitor_child: Servant cluster is healthy (age: 0) Sep 21 18:36:28 ceha03.canlab.ibm.com systemd[1]: Started Shared-storage based fencing daemon. # sbd -d /dev/disk/by-id/scsi-<long_uuid> dump [root@ceha03 by-id]# sbd -d /dev/disk/by-id/scsi-36000c292840d37bd13eb6be46d3af4ab-part1 dump ==Dumping header on disk /dev/disk/by-id/scsi-36000c292840d37bd13eb6be46d3af4ab-part1 Header version : 2.1 UUID : 94d67f15-e301-4fa9-89ae-e3ce2e82c9e7 Number of slots : 255 Sector size : 512 Timeout (watchdog) : 5 Timeout (allocate) : 2 Timeout (loop) : 1 Timeout (msgwait) : 10 ==Header on disk /dev/disk/by-id/scsi-36000c292840d37bd13eb6be46d3af4ab-part1 is dumped Thanks, Phil Stedman Db2 High Availability Development and Support Email: pmste...@us.ibm.com From: Strahil Nikolov <hunter86...@yahoo.com> To: "users@clusterlabs.org" <users@clusterlabs.org> Date: 09/21/2020 01:41 PM Subject: [EXTERNAL] Re: [ClusterLabs] SBD fencing not working on my two-node cluster Sent by: "Users" <users-boun...@clusterlabs.org> Can you provide (replace sensitive data) : crm configure show cat /etc/sysconfig/sbd systemctl status sbd sbd -d /dev/disk/by-id/scsi-<long_uuid> dump P.S.: It is very bad practice to use "/dev/sdXYZ" as these are not permanent.Always use persistent names like those inside "/dev/disk/by-XYZ/ZZZZ". Also , SBD needs max 10MB block device and yours seems unnecessarily big. Most probably /dev/sde1 is your problem. Best Regards, Strahil Nikolov В понеделник, 21 септември 2020 г., 23:19:47 Гринуич+3, Philippe M Stedman <pmste...@us.ibm.com> написа: Hi, I have been following the instructions on the following page to try and configure SBD fencing on my two-node cluster: https://documentation.suse.com/sle-ha/15-SP1/html/SLE-HA-all/cha-ha-storage-protect.html I am able to get through all the steps successfully, I am using the following device (/dev/sde1) as my shared disk: Disk /dev/sde: 20 GiB, 21474836480 bytes, 41943040 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: 43987868-1C0B-41CE-8AF8-C522AB259655 Device Start End Sectors Size Type /dev/sde1 48 41942991 41942944 20G IBM General Parallel Fs Since, I don't have a hardware watchdog at my disposal, I am using the software watchdog (softdog) instead. Having said this, I am able to get through all the steps successfully... I create the fence agent resource successfully, it shows as Started in crm status output: stonith_sbd (stonith:fence_sbd): Started ceha04 The problem is when I run crm node fence ceha04 to test out fencing a host in my cluster. From the crm status output, I see that the reboot action has failed and furthermore, in the system logs, I see the following messages: Sep 21 14:12:33 ceha04 pacemaker-controld[24146]: notice: Requesting fencing (reboot) of node ceha04 Sep 21 14:12:33 ceha04 pacemaker-fenced[24142]: notice: Client pacemaker-controld.24146.5ff1ac0c wants to fence (reboot) 'ceha04' with device '(any)' Sep 21 14:12:33 ceha04 pacemaker-fenced[24142]: notice: Requesting peer fencing (reboot) of ceha04 Sep 21 14:12:33 ceha04 pacemaker-fenced[24142]: notice: Couldn't find anyone to fence (reboot) ceha04 with any device Sep 21 14:12:33 ceha04 pacemaker-fenced[24142]: error: Operation reboot of ceha04 by <no-one> for pacemaker-controld.24146@ceha04.1bad3987: No such device Sep 21 14:12:33 ceha04 pacemaker-controld[24146]: notice: Stonith operation 3/1:4317:0:ec560474-96ea-4984-b801-400d11b5b3ae: No such device (-19) Sep 21 14:12:33 ceha04 pacemaker-controld[24146]: notice: Stonith operation 3 for ceha04 failed (No such device): aborting transition. Sep 21 14:12:33 ceha04 pacemaker-controld[24146]: warning: No devices found in cluster to fence ceha04, giving up Sep 21 14:12:33 ceha04 pacemaker-controld[24146]: notice: Transition 4317 aborted: Stonith failed Sep 21 14:12:33 ceha04 pacemaker-controld[24146]: notice: Peer ceha04 was not terminated (reboot) by <anyone> on behalf of pacemaker-controld.24146: No such device I don't know why Pacemaker isn't able to discover my fencing resource, why isn't it able to find anyone to fence the host from the cluster? Any help is greatly appreciated. I can provide more details as required. Thanks, Phil Stedman Db2 High Availability Development and Support Email: pmste...@us.ibm.com _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/