On Mon, Jan 4, 2021 at 4:22 PM Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de> wrote: > > >>> Steffen Vinther Sørensen <svint...@gmail.com> schrieb am 04.01.2021 um > 16:08 in > Nachricht > <CALhdMBjMMHRF3ENE+=uHty7Lb9vku0o1a6+izpm3zpiM=re...@mail.gmail.com>: > > Hi all, > > I am trying to stabilize a 3-node CentOS7 cluster for production > > usage, VirtualDomains and GFS2 resources. However this following use > > case ends up with node1 fenced, and some Virtualdomains in FAILED > > state. > > > > ------------------ > > pcs standby node2 > > # everything is live migrated to the other 2 nodes > > > > pcs stop node2 > > pcs start node2 > > pcs unstandby node2 > > # node2 is becoming part of the cluster again, since resource > > stickiness is >0 no resources are migrated at this point. > > > > # time of logs is 13:58:07 > > pcs standby node 3 > > > > # node1 gets fenced after a short while > > > > # time of log 14:16:02 and repeats every 15 mins > > node3 log ? > > ------------------ > > > > > > I looked through the logs but I got no clue what is going wrong, > > hoping someone may be able to provide a hint. > > Next time, also indicate which node is DC; it helps to pick the right log ;-) > Quite a lot of systemd and SSH connections IMHO... > At 13:58:27 node1 (kvm03-node01.avigol-gcs.dk) seems gone. > Thus (I guess): > Jan 04 13:59:03 kvm03-node02 stonith-ng[28902]: notice: Requesting that > kvm03-node03.avigol-gcs.dk perform 'reboot' action targeting > kvm03-node01.avigol-gcs.dk > Jan 04 13:59:22 kvm03-node02 crmd[28906]: notice: Peer > kvm03-node01.avigol-gcs.dk was terminated (reboot) by > kvm03-node03.avigol-gcs.dk on behalf of stonith-api.33494: OK > > Maybe your network was flooded during migration of VMs?: > Jan 04 13:58:33 kvm03-node03 corosync[37794]: [TOTEM ] Retransmit List: 1 2 > > You can limit the number of simultaneous migrations, BTW. > > Jan 04 13:58:33 kvm03-node03 cib[37814]: notice: Node > kvm03-node01.avigol-gcs.dk state is now member > Jan 04 13:58:33 kvm03-node03 cib[37814]: notice: Node > kvm03-node01.avigol-gcs.dk state is now lost > Jan 04 13:58:33 kvm03-node03 crmd[37819]: warning: No reason to expect node 1 > to be down > > The above explains fencing. > > Regards, > Ulrich > > > > > > Please find attached > > > > output of 'pcs config' > > logs from all 3 nodes > > the bzcats of pe-error-24.bz2 and pe-error-25.bz2 from node3 > > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/
Setting the cluster property migration-limit=2 seemed to help. Thank you for the advice, and sorry for the excessive logs. /Steffen _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/