Hi and thank you for the insights On Thu, Jan 14, 2021 at 8:26 AM Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de> wrote: > > Hi! > > I'm using SLES, but I think your configuration misses many colocations (IMHO > every ordering should have a correspoonding colocation).
Ok, sloppy to forget that, I fixed it so it now looks like [root@kvm03-node01 ~]# pcs constraint list Location Constraints: Ordering Constraints: start dlm-clone then start clvmd-clone (kind:Mandatory) start clvmd-clone then start sharedfs01-clone (kind:Mandatory) start sharedfs01-clone then start libvirtd-clone (kind:Mandatory) start clvmd-clone then start vmbootimages01-clone (kind:Mandatory) start vmbootimages01-clone then start libvirtd-clone (kind:Mandatory) start libvirtd-clone then start dns-master (kind:Mandatory) start libvirtd-clone then start dns-slave01 (kind:Mandatory) start libvirtd-clone then start httpdfrontend01 (kind:Mandatory) start libvirtd-clone then start httpdfrontend02 (kind:Mandatory) start libvirtd-clone then start highk29 (kind:Mandatory) start libvirtd-clone then start highk30 (kind:Mandatory) start libvirtd-clone then start highk31 (kind:Mandatory) start libvirtd-clone then start highk33 (kind:Mandatory) start libvirtd-clone then start highk34 (kind:Mandatory) start libvirtd-clone then start highk35 (kind:Mandatory) start libvirtd-clone then start stunnel01 (kind:Mandatory) start libvirtd-clone then start stunnel02 (kind:Mandatory) start libvirtd-clone then start crllists01 (kind:Mandatory) start libvirtd-clone then start crllists02 (kind:Mandatory) start libvirtd-clone then start antivirus01 (kind:Mandatory) start libvirtd-clone then start antivirus02 (kind:Mandatory) start libvirtd-clone then start antivirus03 (kind:Mandatory) start libvirtd-clone then start postfixrelay01 (kind:Mandatory) start libvirtd-clone then start fedora-fcos-demo01 (kind:Mandatory) start libvirtd-clone then start centos7-cloud-init-demo01 (kind:Mandatory) start libvirtd-clone then start centos7-virt-builder-docker-demo01 (kind:Mandatory) start libvirtd-clone then start highk32 (kind:Mandatory) Colocation Constraints: clvmd-clone with dlm-clone (score:INFINITY) sharedfs01-clone with clvmd-clone (score:INFINITY) vmbootimages01-clone with clvmd-clone (score:INFINITY) libvirtd-clone with sharedfs01-clone (score:INFINITY) libvirtd-clone with vmbootimages01-clone (score:INFINITY) highk32 with libvirtd-clone (score:INFINITY) dns-master with libvirtd-clone (score:INFINITY) dns-slave01 with libvirtd-clone (score:INFINITY) httpdfrontend01 with libvirtd-clone (score:INFINITY) httpdfrontend02 with libvirtd-clone (score:INFINITY) highk29 with libvirtd-clone (score:INFINITY) highk30 with libvirtd-clone (score:INFINITY) highk31 with libvirtd-clone (score:INFINITY) highk33 with libvirtd-clone (score:INFINITY) highk34 with libvirtd-clone (score:INFINITY) highk35 with libvirtd-clone (score:INFINITY) stunnel01 with libvirtd-clone (score:INFINITY) stunnel02 with libvirtd-clone (score:INFINITY) crllists01 with libvirtd-clone (score:INFINITY) crllists02 with libvirtd-clone (score:INFINITY) antivirus01 with libvirtd-clone (score:INFINITY) antivirus02 with libvirtd-clone (score:INFINITY) antivirus03 with libvirtd-clone (score:INFINITY) postfixrelay01 with libvirtd-clone (score:INFINITY) fedora-fcos-demo01 with libvirtd-clone (score:INFINITY) centos7-cloud-init-demo01 with libvirtd-clone (score:INFINITY) centos7-virt-builder-docker-demo01 with libvirtd-clone (score:INFINITY) Ticket Constraints: > > From the logs of node1, this looks odd to me: > attrd[11024]: error: Connection to the CPG API failed: Library error (2) > > After > systemd[1]: Unit pacemaker.service entered failed state. > it's expected that the node be fenced. > > However this is not fencing IMHO: > Jan 04 13:59:04 kvm03-node01 systemd-logind[5456]: Power key pressed. > Jan 04 13:59:04 kvm03-node01 systemd-logind[5456]: Powering Off... > Fencing is setup like this: pcs stonith create ipmi-fencing-node01 fence_ipmilan pcmk_host_check="static-list" pcmk_host_list="kvm03-node01.avigol-gcs.dk" ipaddr=kvm03-node01-console.avigol-gcs.dk login=xxx passwd=xxx op monitor interval=60s pcs stonith create ipmi-fencing-node02 fence_ipmilan pcmk_host_check="static-list" pcmk_host_list="kvm03-node02.avigol-gcs.dk" ipaddr=kvm03-node02-console.avigol-gcs.dk login=xxx passwd=xxx op monitor interval=60s pcs stonith create ipmi-fencing-node03 fence_ipmilan pcmk_host_check="static-list" pcmk_host_list="kvm03-node03.avigol-gcs.dk" ipaddr=kvm03-node03-console.avigol-gcs.dk login=xxx passwd=xxx op monitor interval=60s The method parameter can be set to cycle instead of onoff (default), but that seems to be not recommended. It occurs that sometimes a node stays powered off when fenced, but most times it reboots, I wonder why not always the same behaviour. > > The main question is what makes the cluster think the node is lost: > Jan 04 13:58:27 kvm03-node01 corosync[10995]: [TOTEM ] A processor failed, > forming new configuration. > Jan 04 13:58:27 kvm03-node02 corosync[28814]: [TOTEM ] A processor failed, > forming new configuration. > > The answer seems to be node3: > Jan 04 13:58:07 kvm03-node03 crmd[37819]: notice: Initiating monitor > operation ipmi-fencing-node02_monitor_60000 on kvm03-node02.avigol-gcs.dk > Jan 04 13:58:07 kvm03-node03 crmd[37819]: notice: Initiating monitor > operation ipmi-fencing-node03_monitor_60000 on kvm03-node01.avigol-gcs.dk > Jan 04 13:58:25 kvm03-node03 corosync[37794]: [TOTEM ] A new membership > (172.31.0.31:1044) was formed. Members > Jan 04 13:58:25 kvm03-node03 corosync[37794]: [CPG ] downlist left_list: 0 > received > Jan 04 13:58:25 kvm03-node03 corosync[37794]: [CPG ] downlist left_list: 0 > received > Jan 04 13:58:25 kvm03-node03 corosync[37794]: [CPG ] downlist left_list: 0 > received > Jan 04 13:58:27 kvm03-node03 corosync[37794]: [TOTEM ] A processor failed, > forming new configuration. > > Before: > Jan 04 13:54:18 kvm03-node03 crmd[37819]: notice: Node > kvm03-node02.avigol-gcs.dk state is now lost > Jan 04 13:54:18 kvm03-node03 crmd[37819]: notice: Node > kvm03-node02.avigol-gcs.dk state is now lost > > No idea why, but then: > Jan 04 13:54:18 kvm03-node03 crmd[37819]: notice: Node > kvm03-node02.avigol-gcs.dk state is now lost > Why "shutdown" and not "fencing"? > > (A side-note on "pe-input-497.bz2": You may want to limit the number of policy > files being kept; here I use 100 as limit) Got it, now limited to 100 by setting these cluster properties pe-error-series-max: 100 pe-input-series-max: 100 pe-warn-series-max: 100 > Node2 then seems to have rejoined before being fenced: > Jan 04 13:57:21 kvm03-node03 crmd[37819]: notice: State transition S_IDLE -> > S_POLICY_ENGINE > > The node3 seems unavailable, moding resource to node2: > Jan 04 13:58:07 kvm03-node03 crmd[37819]: notice: State transition S_IDLE -> > S_POLICY_ENGINE > Jan 04 13:58:07 kvm03-node03 pengine[37818]: notice: * Move > ipmi-fencing-node02 ( kvm03-node03.avigol-gcs.dk -> > kvm03-node02.avigol-gcs.dk ) > Jan 04 13:58:07 kvm03-node03 pengine[37818]: notice: * Move > ipmi-fencing-node03 ( kvm03-node03.avigol-gcs.dk -> > kvm03-node01.avigol-gcs.dk ) > Jan 04 13:58:07 kvm03-node03 pengine[37818]: notice: * Stop dlm:2 > ( kvm03-node03.avigol-gcs.dk ) > due to node availability > > > Then node1 seems gone: > Jan 04 13:58:27 kvm03-node03 corosync[37794]: [TOTEM ] A processor failed, > forming new configuration. > > The suddenly node-1 is here again: > Jan 04 13:58:33 kvm03-node03 crmd[37819]: notice: Stonith/shutdown of > kvm03-node01.avigol-gcs.dk not matched > Jan 04 13:58:33 kvm03-node03 crmd[37819]: notice: Transition aborted: Node > failure > Jan 04 13:58:33 kvm03-node03 cib[37814]: notice: Node > kvm03-node01.avigol-gcs.dk state is now member > Jan 04 13:58:33 kvm03-node03 attrd[37817]: notice: Node > kvm03-node01.avigol-gcs.dk state is now member > Jan 04 13:58:33 kvm03-node03 dlm_controld[39252]: 5452 cpg_mcast_joined retry > 300 plock > Jan 04 13:58:33 kvm03-node03 stonith-ng[37815]: notice: Node > kvm03-node01.avigol-gcs.dk state is now member > > And it's lost again: > Jan 04 13:58:33 kvm03-node03 attrd[37817]: notice: Node > kvm03-node01.avigol-gcs.dk state is now lost > Jan 04 13:58:33 kvm03-node03 cib[37814]: notice: Node > kvm03-node01.avigol-gcs.dk state is now lost > > Jan 04 13:58:33 kvm03-node03 crmd[37819]: warning: No reason to expect node 1 > to be down > Jan 04 13:58:33 kvm03-node03 crmd[37819]: notice: Stonith/shutdown of > kvm03-node01.avigol-gcs.dk not matched > > Then it seems only node1 can fence node1, but communication with node1 is > lost: > Jan 04 13:59:03 kvm03-node03 stonith-ng[37815]: notice: ipmi-fencing-node02 > can not fence (reboot) kvm03-node01.avigol-gcs.dk: static-list > Jan 04 13:59:03 kvm03-node03 stonith-ng[37815]: notice: ipmi-fencing-node03 > can not fence (reboot) kvm03-node01.avigol-gcs.dk: static-list > Jan 04 13:59:03 kvm03-node03 stonith-ng[37815]: notice: ipmi-fencing-node01 > can fence (reboot) kvm03-node01.avigol-gcs.dk: static-list > Jan 04 13:59:03 kvm03-node03 stonith-ng[37815]: notice: ipmi-fencing-node02 > can not fence (reboot) kvm03-node01.avigol-gcs.dk: static-list > Jan 04 13:59:03 kvm03-node03 stonith-ng[37815]: notice: ipmi-fencing-node03 > can not fence (reboot) kvm03-node01.avigol-gcs.dk: static-list > Jan 04 13:59:03 kvm03-node03 stonith-ng[37815]: notice: ipmi-fencing-node01 > can fence (reboot) kvm03-node01.avigol-gcs.dk: static-list > > No surprise then: > Jan 04 13:59:22 kvm03-node03 VirtualDomain(highk32)[25015]: ERROR: highk32: > live migration to kvm03-node02.avigol-gcs.dk failed: 1 > > At Jan 04 13:59:23 node3 seems down. > Jan 04 13:59:23 kvm03-node03 pengine[37818]: notice: * Stop dlm:1 > ( > kvm03-node03.avigol-gcs.dk ) due to node availability > > This will triffer fencing of node3: > Jan 04 14:00:56 kvm03-node03 VirtualDomain(highk35)[27057]: ERROR: forced stop > failed > > Jan 04 14:00:56 kvm03-node03 pengine[37818]: notice: * Stop dlm:1 > ( > kvm03-node03.avigol-gcs.dk ) due to node availability > > At > Jan 04 14:00:58 kvm03-node03 Filesystem(sharedfs01)[27209]: INFO: Trying to > unmount /usr/local/sharedfs01 > > it seems there are VMs running the cluster did not know about. Yes, seems strange, as vms are started by cluster only. I just did a test after the latest adjustments with colocations etc. trying to standby node02, ends up with node02 being fenced before migrations complete. Unfortunately logs from node02 was lost This is a few log stmts from node03: A standby for node02 was issued, and vms starting to migrate, when turn comes to vm highk29: Jan 16 18:02:15 kvm03-node03 crmd[32778]: notice: Initiating migrate_to operation highk29_migrate_to_0 on kvm03-node02.logiva-gcs.dk Jan 16 18:02:16 kvm03-node03 systemd-machined[8806]: New machine qemu-3-highk29. Suddenly it seems node02 leaves the cluster before migrations are completed, why would it think of doing that, only 4 secs after starting a migration: Jan 16 18:02:20 kvm03-node03 corosync[32756]: [TOTEM ] A processor failed, forming new configuration. Jan 16 18:02:22 kvm03-node03 corosync[32756]: [TOTEM ] A new membership (172.31.0.31:1412) was formed. Members left: 2 Jan 16 18:02:22 kvm03-node03 corosync[32756]: [TOTEM ] Failed to receive the leave message. failed: 2 Jan 16 18:02:22 kvm03-node03 corosync[32756]: [CPG ] downlist left_list: 1 received Jan 16 18:02:22 kvm03-node03 corosync[32756]: [CPG ] downlist left_list: 1 received Seems reasonable now that node02 should be fenced, which is what happens From the stonith-ng logs it reads that only node02 can fence node02 ?? Jan 16 18:02:22 kvm03-node03 crmd[32778]: notice: Stonith/shutdown of kvm03-node02.logiva-gcs.dk not matched Jan 16 18:02:22 kvm03-node03 crmd[32778]: notice: Transition 877 (Complete=13, Pending=0, Fired=0, Skipped=23, Incomplete=37, Source=/var/lib/pacemaker/pengine/pe-input-629.bz2): Stopped Jan 16 18:02:22 kvm03-node03 stonith-ng[32774]: notice: ipmi-fencing-node02 can fence (reboot) kvm03-node02.logiva-gcs.dk: static-list Jan 16 18:02:22 kvm03-node03 stonith-ng[32774]: notice: ipmi-fencing-node03 can not fence (reboot) kvm03-node02.logiva-gcs.dk: static-list Jan 16 18:02:22 kvm03-node03 stonith-ng[32774]: notice: ipmi-fencing-node01 can not fence (reboot) kvm03-node02.logiva-gcs.dk: static-list Jan 16 18:02:22 kvm03-node03 stonith-ng[32774]: notice: ipmi-fencing-node02 can fence (reboot) kvm03-node02.logiva-gcs.dk: static-list Jan 16 18:02:22 kvm03-node03 stonith-ng[32774]: notice: ipmi-fencing-node03 can not fence (reboot) kvm03-node02.logiva-gcs.dk: static-list Jan 16 18:02:22 kvm03-node03 stonith-ng[32774]: notice: ipmi-fencing-node01 can not fence (reboot) kvm03-node02.logiva-gcs.dk: static-list Jan 16 18:02:23 kvm03-node03 pengine[32777]: warning: Cluster node kvm03-node02.logiva-gcs.dk will be fenced: peer is no longer part of the cluster Jan 16 18:02:23 kvm03-node03 pengine[32777]: warning: Node kvm03-node02.logiva-gcs.dk is unclean > > -- The virtual machine qemu-16-centos7-virt-builder-docker-demo01 with its > leader PID 1029 has been > -- shut down. > Jan 04 14:00:58 kvm03-node03 kernel: br40: port 7(vnet8) entered disabled > state > > So I see multiple issues with this configuration. > > I suggest to start with one VM configured and then make tests; if successful > add one or more VMs and repeat testing. > If test was not successful find out what went wrong and try to fix it. Repeat > test. Ok, will do > > Sorry, I don't have a better answer for you. > > Regards, > Ulrich > > >>> Steffen Vinther Sørensen <svint...@gmail.com> schrieb am 04.01.2021 um > 16:08 in > Nachricht > <CALhdMBjMMHRF3ENE+=uHty7Lb9vku0o1a6+izpm3zpiM=re...@mail.gmail.com>: > > Hi all, > > I am trying to stabilize a 3-node CentOS7 cluster for production > > usage, VirtualDomains and GFS2 resources. However this following use > > case ends up with node1 fenced, and some Virtualdomains in FAILED > > state. > > > > ------------------ > > pcs standby node2 > > # everything is live migrated to the other 2 nodes > > > > pcs stop node2 > > pcs start node2 > > pcs unstandby node2 > > # node2 is becoming part of the cluster again, since resource > > stickiness is >0 no resources are migrated at this point. > > > > # time of logs is 13:58:07 > > pcs standby node 3 > > > > # node1 gets fenced after a short while > > > > # time of log 14:16:02 and repeats every 15 mins > > node3 log ? > > ------------------ > > > > > > I looked through the logs but I got no clue what is going wrong, > > hoping someone may be able to provide a hint. > > > > Please find attached > > > > output of 'pcs config' > > logs from all 3 nodes > > the bzcats of pe-error-24.bz2 and pe-error-25.bz2 from node3 > > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/