Hi! I'm using SLES, but I think your configuration misses many colocations (IMHO every ordering should have a correspoonding colocation).
From the logs of node1, this looks odd to me: attrd[11024]: error: Connection to the CPG API failed: Library error (2) After systemd[1]: Unit pacemaker.service entered failed state. it's expected that the node be fenced. However this is not fencing IMHO: Jan 04 13:59:04 kvm03-node01 systemd-logind[5456]: Power key pressed. Jan 04 13:59:04 kvm03-node01 systemd-logind[5456]: Powering Off... The main question is what makes the cluster think the node is lost: Jan 04 13:58:27 kvm03-node01 corosync[10995]: [TOTEM ] A processor failed, forming new configuration. Jan 04 13:58:27 kvm03-node02 corosync[28814]: [TOTEM ] A processor failed, forming new configuration. The answer seems to be node3: Jan 04 13:58:07 kvm03-node03 crmd[37819]: notice: Initiating monitor operation ipmi-fencing-node02_monitor_60000 on kvm03-node02.avigol-gcs.dk Jan 04 13:58:07 kvm03-node03 crmd[37819]: notice: Initiating monitor operation ipmi-fencing-node03_monitor_60000 on kvm03-node01.avigol-gcs.dk Jan 04 13:58:25 kvm03-node03 corosync[37794]: [TOTEM ] A new membership (172.31.0.31:1044) was formed. Members Jan 04 13:58:25 kvm03-node03 corosync[37794]: [CPG ] downlist left_list: 0 received Jan 04 13:58:25 kvm03-node03 corosync[37794]: [CPG ] downlist left_list: 0 received Jan 04 13:58:25 kvm03-node03 corosync[37794]: [CPG ] downlist left_list: 0 received Jan 04 13:58:27 kvm03-node03 corosync[37794]: [TOTEM ] A processor failed, forming new configuration. Before: Jan 04 13:54:18 kvm03-node03 crmd[37819]: notice: Node kvm03-node02.avigol-gcs.dk state is now lost Jan 04 13:54:18 kvm03-node03 crmd[37819]: notice: Node kvm03-node02.avigol-gcs.dk state is now lost No idea why, but then: Jan 04 13:54:18 kvm03-node03 crmd[37819]: notice: Node kvm03-node02.avigol-gcs.dk state is now lost Why "shutdown" and not "fencing"? (A side-note on "pe-input-497.bz2": You may want to limit the number of policy files being kept; here I use 100 as limit) Node2 then seems to have rejoined before being fenced: Jan 04 13:57:21 kvm03-node03 crmd[37819]: notice: State transition S_IDLE -> S_POLICY_ENGINE The node3 seems unavailable, moding resource to node2: Jan 04 13:58:07 kvm03-node03 crmd[37819]: notice: State transition S_IDLE -> S_POLICY_ENGINE Jan 04 13:58:07 kvm03-node03 pengine[37818]: notice: * Move ipmi-fencing-node02 ( kvm03-node03.avigol-gcs.dk -> kvm03-node02.avigol-gcs.dk ) Jan 04 13:58:07 kvm03-node03 pengine[37818]: notice: * Move ipmi-fencing-node03 ( kvm03-node03.avigol-gcs.dk -> kvm03-node01.avigol-gcs.dk ) Jan 04 13:58:07 kvm03-node03 pengine[37818]: notice: * Stop dlm:2 ( kvm03-node03.avigol-gcs.dk ) due to node availability Then node1 seems gone: Jan 04 13:58:27 kvm03-node03 corosync[37794]: [TOTEM ] A processor failed, forming new configuration. The suddenly node-1 is here again: Jan 04 13:58:33 kvm03-node03 crmd[37819]: notice: Stonith/shutdown of kvm03-node01.avigol-gcs.dk not matched Jan 04 13:58:33 kvm03-node03 crmd[37819]: notice: Transition aborted: Node failure Jan 04 13:58:33 kvm03-node03 cib[37814]: notice: Node kvm03-node01.avigol-gcs.dk state is now member Jan 04 13:58:33 kvm03-node03 attrd[37817]: notice: Node kvm03-node01.avigol-gcs.dk state is now member Jan 04 13:58:33 kvm03-node03 dlm_controld[39252]: 5452 cpg_mcast_joined retry 300 plock Jan 04 13:58:33 kvm03-node03 stonith-ng[37815]: notice: Node kvm03-node01.avigol-gcs.dk state is now member And it's lost again: Jan 04 13:58:33 kvm03-node03 attrd[37817]: notice: Node kvm03-node01.avigol-gcs.dk state is now lost Jan 04 13:58:33 kvm03-node03 cib[37814]: notice: Node kvm03-node01.avigol-gcs.dk state is now lost Jan 04 13:58:33 kvm03-node03 crmd[37819]: warning: No reason to expect node 1 to be down Jan 04 13:58:33 kvm03-node03 crmd[37819]: notice: Stonith/shutdown of kvm03-node01.avigol-gcs.dk not matched Then it seems only node1 can fence node1, but communication with node1 is lost: Jan 04 13:59:03 kvm03-node03 stonith-ng[37815]: notice: ipmi-fencing-node02 can not fence (reboot) kvm03-node01.avigol-gcs.dk: static-list Jan 04 13:59:03 kvm03-node03 stonith-ng[37815]: notice: ipmi-fencing-node03 can not fence (reboot) kvm03-node01.avigol-gcs.dk: static-list Jan 04 13:59:03 kvm03-node03 stonith-ng[37815]: notice: ipmi-fencing-node01 can fence (reboot) kvm03-node01.avigol-gcs.dk: static-list Jan 04 13:59:03 kvm03-node03 stonith-ng[37815]: notice: ipmi-fencing-node02 can not fence (reboot) kvm03-node01.avigol-gcs.dk: static-list Jan 04 13:59:03 kvm03-node03 stonith-ng[37815]: notice: ipmi-fencing-node03 can not fence (reboot) kvm03-node01.avigol-gcs.dk: static-list Jan 04 13:59:03 kvm03-node03 stonith-ng[37815]: notice: ipmi-fencing-node01 can fence (reboot) kvm03-node01.avigol-gcs.dk: static-list No surprise then: Jan 04 13:59:22 kvm03-node03 VirtualDomain(highk32)[25015]: ERROR: highk32: live migration to kvm03-node02.avigol-gcs.dk failed: 1 At Jan 04 13:59:23 node3 seems down. Jan 04 13:59:23 kvm03-node03 pengine[37818]: notice: * Stop dlm:1 ( kvm03-node03.avigol-gcs.dk ) due to node availability This will triffer fencing of node3: Jan 04 14:00:56 kvm03-node03 VirtualDomain(highk35)[27057]: ERROR: forced stop failed Jan 04 14:00:56 kvm03-node03 pengine[37818]: notice: * Stop dlm:1 ( kvm03-node03.avigol-gcs.dk ) due to node availability At Jan 04 14:00:58 kvm03-node03 Filesystem(sharedfs01)[27209]: INFO: Trying to unmount /usr/local/sharedfs01 it seems there are VMs running the cluster did not know about. -- The virtual machine qemu-16-centos7-virt-builder-docker-demo01 with its leader PID 1029 has been -- shut down. Jan 04 14:00:58 kvm03-node03 kernel: br40: port 7(vnet8) entered disabled state So I see multiple issues with this configuration. I suggest to start with one VM configured and then make tests; if successful add one or more VMs and repeat testing. If test was not successful find out what went wrong and try to fix it. Repeat test. Sorry, I don't have a better answer for you. Regards, Ulrich >>> Steffen Vinther Sørensen <svint...@gmail.com> schrieb am 04.01.2021 um 16:08 in Nachricht <CALhdMBjMMHRF3ENE+=uHty7Lb9vku0o1a6+izpm3zpiM=re...@mail.gmail.com>: > Hi all, > I am trying to stabilize a 3-node CentOS7 cluster for production > usage, VirtualDomains and GFS2 resources. However this following use > case ends up with node1 fenced, and some Virtualdomains in FAILED > state. > > ------------------ > pcs standby node2 > # everything is live migrated to the other 2 nodes > > pcs stop node2 > pcs start node2 > pcs unstandby node2 > # node2 is becoming part of the cluster again, since resource > stickiness is >0 no resources are migrated at this point. > > # time of logs is 13:58:07 > pcs standby node 3 > > # node1 gets fenced after a short while > > # time of log 14:16:02 and repeats every 15 mins > node3 log ? > ------------------ > > > I looked through the logs but I got no clue what is going wrong, > hoping someone may be able to provide a hint. > > Please find attached > > output of 'pcs config' > logs from all 3 nodes > the bzcats of pe-error-24.bz2 and pe-error-25.bz2 from node3 _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/