Any logs from Pacemaker? On Thu, Apr 25, 2024 at 3:46 AM Alexander Eastwood via Users <users@clusterlabs.org> wrote: > > Hi all, > > I’m trying to get a better understanding of why our cluster - or specifically > corosync.service - entered a failed state. Here are all of the relevant > corosync logs from this event, with the last line showing when I manually > started the service again: > > Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice [CFG ] Node 1 was > shut down by sysadmin > Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice [SERV ] Unloading > all Corosync service engines. > Apr 23 11:06:10 [1295854] testcluster-c1 corosync info [QB ] > withdrawing server sockets > Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice [SERV ] Service > engine unloaded: corosync vote quorum service v1.0 > Apr 23 11:06:10 [1295854] testcluster-c1 corosync info [QB ] > withdrawing server sockets > Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice [SERV ] Service > engine unloaded: corosync configuration map access > Apr 23 11:06:10 [1295854] testcluster-c1 corosync info [QB ] > withdrawing server sockets > Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice [SERV ] Service > engine unloaded: corosync configuration service > Apr 23 11:06:10 [1295854] testcluster-c1 corosync info [QB ] > withdrawing server sockets > Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice [SERV ] Service > engine unloaded: corosync cluster closed process group service v1.01 > Apr 23 11:06:10 [1295854] testcluster-c1 corosync info [QB ] > withdrawing server sockets > Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice [SERV ] Service > engine unloaded: corosync cluster quorum service v0.1 > Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice [SERV ] Service > engine unloaded: corosync profile loading service > Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice [SERV ] Service > engine unloaded: corosync resource monitoring service > Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice [SERV ] Service > engine unloaded: corosync watchdog service > Apr 23 11:06:11 [1295854] testcluster-c1 corosync info [KNET ] host: > host: 1 (passive) best link: 0 (pri: 0) > Apr 23 11:06:11 [1295854] testcluster-c1 corosync warning [KNET ] host: > host: 1 has no active links > Apr 23 11:06:11 [1295854] testcluster-c1 corosync notice [MAIN ] Corosync > Cluster Engine exiting normally > Apr 23 13:18:36 [796246] testcluster-c1 corosync notice [MAIN ] Corosync > Cluster Engine 3.1.6 starting up > > The first line suggests that a manual shutdown of one of the cluster nodes, > however neither me nor any of my colleagues did this. The ‘sysadmin’ surely > must mean a person logging on to the server and running some command, as > opposed to a system process? > > Then in the 3rd row from the bottom there is the warning “host: host: 1 has > no active links” which is followed by “Corosync Cluster Engine exiting > normally”. Does this mean that the reason for the Cluster Engine exiting is > the fact that there are no active links? > > Finally, I am considering adding a systemd override file for the corosync > service with the following content: > > [Service] > Restart=on-failure > > Is there any reason not to do this? And, given that the process exited > normally, would I need to use Restart=always instead? > > Many thanks > > Alex > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/
-- Regards, Reid Wahl (He/Him) Senior Software Engineer, Red Hat RHEL High Availability - Pacemaker _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/