On Fri, Apr 16, 2021 at 6:56 AM Andrei Borzenkov <arvidj...@gmail.com> wrote: > > On 15.04.2021 23:09, Steffen Vinther Sørensen wrote: > > On Thu, Apr 15, 2021 at 3:39 PM Klaus Wenninger <kwenn...@redhat.com> wrote: > >> > >> On 4/15/21 3:26 PM, Ulrich Windl wrote: > >>>>>> Steffen Vinther Sørensen <svint...@gmail.com> schrieb am 15.04.2021 um > >>> 14:56 in > >>> Nachricht > >>> <calhdmbixzoyf-gxg82ont4mgfm6q-_imceuvhypgwky41jj...@mail.gmail.com>: > >>>> On Thu, Apr 15, 2021 at 2:29 PM Ulrich Windl > >>>> <ulrich.wi...@rz.uni-regensburg.de> wrote: > >>>>>>>> Steffen Vinther Sørensen <svint...@gmail.com> schrieb am 15.04.2021 > >>>>>>>> um > >>>>> 13:10 in > >>>>> Nachricht > >>>>> <CALhdMBhMQRwmgoWEWuiGMDr7HfVOTTKvW8=nqms2p2e9p8y...@mail.gmail.com>: > >>>>>> Hi there, > >>>>>> > >>>>>> In this 3 node cluster, node03 been offline for a while, and being > >>>>>> brought up to service. Then a migration of a VirtualDomain is being > >>>>>> attempted, and node02 is then fenced. > >>>>>> > >>>>>> Provided is logs from all 2 nodes, and the 'pcs config' as well as a > >>>>>> bzcatted pe-warn. Anyone with an idea of why the node was fenced ? Is > >>>>>> it because of the failed ipmi monitor warning ? > >>>>> After a short glace it looks as if the network traffic used for VM > >>> migration > >>>>> killed the corosync (or other) communication. > >>>>> > >>>> May I ask what part is making you think so ? > >>> The part that I saw no reason for an intended fencing. > >> And it looks like node02 is being cut off from all > >> networking-communication - both corosync & ipmi. > >> May really be the networking-load although I would > >> rather bet on something more systematic like a > >> Mac/IP-conflict with the VM or something. > >> I see you are having libvirtd under cluster-control. > >> Maybe bringing up the network-topology destroys the > >> connection between the nodes. > >> Has the cluster been working with the 3 nodes before? > >> > >> > >> Klaus > > > > Hi Klaus > > > > Yes it has been working before with all 3 nodes and migrations back > > and forth, but a few more VirtualDomains have been deployed since the > > last migration test. > > > > It happens very fast, almost immediately after migration is starting. > > Could it be that some timeout values should be adjusted ? > > I just don't have any idea where to start looking, as to me there is > > nothing obviously suspicious found in the logs. > > > > > I would look at performance stats, may be node02 was overloaded and > could not answer in time. Although standard sar stats are collected > every 15 minutes which is usually too coarse for it. > > Migration could stress network. Talk with your network support, any > errors around this time?
I see no network errors around that time when checking e-mails and syslogs from network equipment. Last night I tried to bring up the node02 that was fenced earlier 'pcs cluster start', and initiated a migration. Same thing happened, node03 was fenced almost immediately. Then I tried to bring back up node03 and leave it for the night. This morning I then did several migrations successfully. So it might be something that needs more time to get up, maybe the clustermanaged-libvirtd network components. I have Prometheus scraping node_exporter from all 3 nodes, and I can dig around network traffic around the incidents, for the 2 failing incidents, upon migration the traffic rises to a stable 250Mb/s or 600Mb/s for a couple of minutes. For successful migrations, network traffic always goes to 1000Mb/s which is the max for single connection, the nodes have 4x1000Mb nics bonded, and there is very low traffic going on there around any of the incidents /Steffen > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/