Hi all,

I’m trying to get a better understanding of why our cluster - or specifically 
corosync.service - entered a failed state. Here are all of the relevant 
corosync logs from this event, with the last line showing when I manually 
started the service again:

Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [CFG   ] Node 1 was 
shut down by sysadmin
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Unloading 
all Corosync service engines.
Apr 23 11:06:10 [1295854] testcluster-c1 corosync info    [QB    ] withdrawing 
server sockets
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
engine unloaded: corosync vote quorum service v1.0
Apr 23 11:06:10 [1295854] testcluster-c1 corosync info    [QB    ] withdrawing 
server sockets
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
engine unloaded: corosync configuration map access
Apr 23 11:06:10 [1295854] testcluster-c1 corosync info    [QB    ] withdrawing 
server sockets
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
engine unloaded: corosync configuration service
Apr 23 11:06:10 [1295854] testcluster-c1 corosync info    [QB    ] withdrawing 
server sockets
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
engine unloaded: corosync cluster closed process group service v1.01
Apr 23 11:06:10 [1295854] testcluster-c1 corosync info    [QB    ] withdrawing 
server sockets
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
engine unloaded: corosync cluster quorum service v0.1
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
engine unloaded: corosync profile loading service
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
engine unloaded: corosync resource monitoring service
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
engine unloaded: corosync watchdog service
Apr 23 11:06:11 [1295854] testcluster-c1 corosync info    [KNET  ] host: host: 
1 (passive) best link: 0 (pri: 0)
Apr 23 11:06:11 [1295854] testcluster-c1 corosync warning [KNET  ] host: host: 
1 has no active links
Apr 23 11:06:11 [1295854] testcluster-c1 corosync notice  [MAIN  ] Corosync 
Cluster Engine exiting normally
Apr 23 13:18:36 [796246] testcluster-c1 corosync notice  [MAIN  ] Corosync 
Cluster Engine 3.1.6 starting up

The first line suggests that a manual shutdown of one of the cluster nodes, 
however neither me nor any of my colleagues did this. The ‘sysadmin’ surely 
must mean a person logging on to the server and running some command, as 
opposed to a system process?

Then in the 3rd row from the bottom there is the warning “host: host: 1 has no 
active links” which is followed by “Corosync Cluster Engine exiting normally”. 
Does this mean that the reason for the Cluster Engine exiting is the fact that 
there are no active links? 

Finally, I am considering adding a systemd override file for the corosync 
service with the following content:

[Service]
Restart=on-failure

Is there any reason not to do this? And, given that the process exited 
normally, would I need to use Restart=always instead?

Many thanks

Alex
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to