Re: [ClusterLabs] corosync service stopping

2024-04-30 Thread Alexander Eastwood via Users
Hi Honza

I would say there is still a certain ambiguity in "shutdown by cfg request”, 
but I would argue that by not using the term “sysadmin” it at least doesn’t 
suggest that the shutdown was triggered by a human. So yes, I think that this 
phrasing is less misleading.

Cheers,

Alex

> On 29.04.2024, at 09:56, Jan Friesse  wrote:
> 
> Hi,
> I will reply just to "sysadmin" question:
> 
> On 26/04/2024 14:43, Alexander Eastwood via Users wrote:
>> Dear Reid,
> ...
> 
>> Why does the corosync log say ’shutdown by sysadmin’ when the shutdown was 
>> triggered by pacemaker? Isn’t this misleading?
> 
> This basically means shutdown was triggered by calling corosync cfg api. I 
> can agree "sysadmin" is misleading. Problem is, same cfg api call is used by 
> corosync-cfgtool and corosync-cfgtool is used in systemd service file and 
> here it is really probably sysadmin who initiated the shutdown.
> 
> Currently the function where this log message is printed has no information 
> about which process initiated shutdown. It knows only nodeid.
> 
> It would be possible to log some more info (probably also with proc_name) in 
> the cfg API function call, but then it is probably good candidate for DEBUG 
> log level.
> 
> So do you think "shutdown by cfg request" would be less misleading?
> 
> Regards
>  Honza
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] corosync service stopping

2024-04-26 Thread Alexander Eastwood via Users
xit)      info: pacemaker-execd[1295872] exited with status 0 (OK)Apr 23 11:06:03.970 testcluster-c1 pacemakerd          [1295869] (stop_child)   notice: Stopping pacemaker-fenced | sent signal 15 to process 1295871Apr 23 11:06:03.978 testcluster-c1 pacemaker-fenced    [1295871] (crm_signal_dispatch)  notice: Caught 'Terminated' signal | 15 (invoking handler)Apr 23 11:06:04.010 testcluster-c1 pacemaker-fenced    [1295871] (stonith_shutdown)     info: Terminating with 0 clientsApr 23 11:06:04.630 testcluster-c1 pacemaker-fenced    [1295871] (cib_connection_destroy)       info: Connection to the CIB manager closedApr 23 11:06:04.702 testcluster-c1 pacemaker-fenced    [1295871] (qb_ipcs_us_withdraw)  info: withdrawing server socketsApr 23 11:06:04.778 testcluster-c1 pacemaker-fenced    [1295871] (crm_xml_cleanup)      info: Cleaning up memory from libxml2Apr 23 11:06:04.842 testcluster-c1 pacemaker-fenced    [1295871] (crm_exit)     info: Exiting pacemaker-fenced | with status 0Apr 23 11:06:06.258 testcluster-c1 pacemakerd          [1295869] (pcmk_child_exit)      info: pacemaker-fenced[1295871] exited with status 0 (OK)Apr 23 11:06:06.266 testcluster-c1 pacemaker-based     [1295870] (crm_signal_dispatch)  notice: Caught 'Terminated' signal | 15 (invoking handler)Apr 23 11:06:06.266 testcluster-c1 pacemakerd          [1295869] (stop_child)   notice: Stopping pacemaker-based | sent signal 15 to process 1295870Apr 23 11:06:06.274 testcluster-c1 pacemaker-based     [1295870] (cib_shutdown)         info: Disconnected 0 clientsApr 23 11:06:06.274 testcluster-c1 pacemaker-based     [1295870] (cib_shutdown)         info: All clients disconnected (0)Apr 23 11:06:06.282 testcluster-c1 pacemaker-based     [1295870] (initiate_exit)        info: Sending disconnect notification to 2 peers...Apr 23 11:06:06.334 testcluster-c1 pacemaker-based     [1295870] (cib_process_shutdown_req)     info: Peer testcluster-c1 is requesting to shut downApr 23 11:06:06.346 testcluster-c1 pacemaker-based     [1295870] (cib_process_shutdown_req)     info: Peer testcluster-c2 has acknowledged our shutdown requestApr 23 11:06:06.346 testcluster-c1 pacemaker-based     [1295870] (terminate_cib)        info: cib_process_shutdown_req: Exiting from mainloop...Apr 23 11:06:06.350 testcluster-c1 pacemaker-based     [1295870] (crm_cluster_disconnect)       info: Disconnecting from corosync cluster infrastructureApr 23 11:06:06.358 testcluster-c1 pacemaker-based     [1295870] (pcmk__corosync_disconnect)    notice: Disconnected from CorosyncApr 23 11:06:06.366 testcluster-c1 pacemaker-based     [1295870] (crm_get_peer)         info: Created entry 48e34a90-0596-48f1-b2b2-171d1692aac5/0x564d66683e40 for node testcluster-c2/0 (1 total)Apr 23 11:06:06.406 testcluster-c1 pacemaker-based     [1295870] (cib_peer_update_callback)     info: No more peersApr 23 11:06:06.422 testcluster-c1 pacemaker-based     [1295870] (terminate_cib)        info: cib_peer_update_callback: Exiting from mainloop...Apr 23 11:06:06.426 testcluster-c1 pacemaker-based     [1295870] (crm_cluster_disconnect)       info: Disconnecting from corosync cluster infrastructureApr 23 11:06:06.426 testcluster-c1 pacemaker-based     [1295870] (cluster_disconnect_cpg)       info: No CPG connectionApr 23 11:06:06.426 testcluster-c1 pacemaker-based     [1295870] (pcmk__corosync_disconnect)    notice: Disconnected from CorosyncApr 23 11:06:06.426 testcluster-c1 pacemaker-based     [1295870] (qb_ipcs_us_withdraw)  info: withdrawing server socketsApr 23 11:06:06.442 testcluster-c1 pacemaker-based     [1295870] (qb_ipcs_us_withdraw)  info: withdrawing server socketsApr 23 11:06:06.442 testcluster-c1 pacemaker-based     [1295870] (qb_ipcs_us_withdraw)  info: withdrawing server socketsApr 23 11:06:06.458 testcluster-c1 pacemaker-based     [1295870] (crm_xml_cleanup)      info: Cleaning up memory from libxml2Apr 23 11:06:09.682 testcluster-c1 pacemaker-based     [1295870] (crm_exit)     info: Exiting pacemaker-based | with status 0Apr 23 11:06:10.450 testcluster-c1 pacemakerd          [1295869] (pcmk_child_exit)      info: pacemaker-based[1295870] exited with status 0 (OK)Apr 23 11:06:10.458 testcluster-c1 pacemakerd          [1295869] (pcmk_shutdown_worker)         notice: Shutdown completeApr 23 11:06:10.458 testcluster-c1 pacemakerd          [1295869] (pcmk_shutdown_worker)         notice: Shutting down and staying down after fatal errorApr 23 11:06:10.458 testcluster-c1 pacemakerd          [1295869] (pcmkd_shutdown_corosync)      info: Asking Corosync to shut downApr 23 11:06:10.530 testcluster-c1 pacemakerd          [1295869] (crm_xml_cleanup)      info: Cleaning up memory from libxml2Apr 23 11:06:10.554 testcluster-c1 pacemakerd          [1295869] (crm_exit)     info: Exiting pacemakerd | with status 100On 26.04.2024, at 05:54, Reid Wahl  wrote:Any logs from Pacemaker?On Thu, Apr 25, 2024 at 3:46 AM Alexander Eastwood via Users wrote:Hi all,I’m trying to get a better understanding of why our cluster - or speci

[ClusterLabs] corosync service stopping

2024-04-25 Thread Alexander Eastwood via Users
Hi all,

I’m trying to get a better understanding of why our cluster - or specifically 
corosync.service - entered a failed state. Here are all of the relevant 
corosync logs from this event, with the last line showing when I manually 
started the service again:

Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [CFG   ] Node 1 was 
shut down by sysadmin
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Unloading 
all Corosync service engines.
Apr 23 11:06:10 [1295854] testcluster-c1 corosync info[QB] withdrawing 
server sockets
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
engine unloaded: corosync vote quorum service v1.0
Apr 23 11:06:10 [1295854] testcluster-c1 corosync info[QB] withdrawing 
server sockets
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
engine unloaded: corosync configuration map access
Apr 23 11:06:10 [1295854] testcluster-c1 corosync info[QB] withdrawing 
server sockets
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
engine unloaded: corosync configuration service
Apr 23 11:06:10 [1295854] testcluster-c1 corosync info[QB] withdrawing 
server sockets
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
engine unloaded: corosync cluster closed process group service v1.01
Apr 23 11:06:10 [1295854] testcluster-c1 corosync info[QB] withdrawing 
server sockets
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
engine unloaded: corosync cluster quorum service v0.1
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
engine unloaded: corosync profile loading service
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
engine unloaded: corosync resource monitoring service
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
engine unloaded: corosync watchdog service
Apr 23 11:06:11 [1295854] testcluster-c1 corosync info[KNET  ] host: host: 
1 (passive) best link: 0 (pri: 0)
Apr 23 11:06:11 [1295854] testcluster-c1 corosync warning [KNET  ] host: host: 
1 has no active links
Apr 23 11:06:11 [1295854] testcluster-c1 corosync notice  [MAIN  ] Corosync 
Cluster Engine exiting normally
Apr 23 13:18:36 [796246] testcluster-c1 corosync notice  [MAIN  ] Corosync 
Cluster Engine 3.1.6 starting up

The first line suggests that a manual shutdown of one of the cluster nodes, 
however neither me nor any of my colleagues did this. The ‘sysadmin’ surely 
must mean a person logging on to the server and running some command, as 
opposed to a system process?

Then in the 3rd row from the bottom there is the warning “host: host: 1 has no 
active links” which is followed by “Corosync Cluster Engine exiting normally”. 
Does this mean that the reason for the Cluster Engine exiting is the fact that 
there are no active links? 

Finally, I am considering adding a systemd override file for the corosync 
service with the following content:

[Service]
Restart=on-failure

Is there any reason not to do this? And, given that the process exited 
normally, would I need to use Restart=always instead?

Many thanks

Alex
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/