Re: [ClusterLabs] corosync service stopping
Hi Honza I would say there is still a certain ambiguity in "shutdown by cfg request”, but I would argue that by not using the term “sysadmin” it at least doesn’t suggest that the shutdown was triggered by a human. So yes, I think that this phrasing is less misleading. Cheers, Alex > On 29.04.2024, at 09:56, Jan Friesse wrote: > > Hi, > I will reply just to "sysadmin" question: > > On 26/04/2024 14:43, Alexander Eastwood via Users wrote: >> Dear Reid, > ... > >> Why does the corosync log say ’shutdown by sysadmin’ when the shutdown was >> triggered by pacemaker? Isn’t this misleading? > > This basically means shutdown was triggered by calling corosync cfg api. I > can agree "sysadmin" is misleading. Problem is, same cfg api call is used by > corosync-cfgtool and corosync-cfgtool is used in systemd service file and > here it is really probably sysadmin who initiated the shutdown. > > Currently the function where this log message is printed has no information > about which process initiated shutdown. It knows only nodeid. > > It would be possible to log some more info (probably also with proc_name) in > the cfg API function call, but then it is probably good candidate for DEBUG > log level. > > So do you think "shutdown by cfg request" would be less misleading? > > Regards > Honza > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] corosync service stopping
xit) info: pacemaker-execd[1295872] exited with status 0 (OK)Apr 23 11:06:03.970 testcluster-c1 pacemakerd [1295869] (stop_child) notice: Stopping pacemaker-fenced | sent signal 15 to process 1295871Apr 23 11:06:03.978 testcluster-c1 pacemaker-fenced [1295871] (crm_signal_dispatch) notice: Caught 'Terminated' signal | 15 (invoking handler)Apr 23 11:06:04.010 testcluster-c1 pacemaker-fenced [1295871] (stonith_shutdown) info: Terminating with 0 clientsApr 23 11:06:04.630 testcluster-c1 pacemaker-fenced [1295871] (cib_connection_destroy) info: Connection to the CIB manager closedApr 23 11:06:04.702 testcluster-c1 pacemaker-fenced [1295871] (qb_ipcs_us_withdraw) info: withdrawing server socketsApr 23 11:06:04.778 testcluster-c1 pacemaker-fenced [1295871] (crm_xml_cleanup) info: Cleaning up memory from libxml2Apr 23 11:06:04.842 testcluster-c1 pacemaker-fenced [1295871] (crm_exit) info: Exiting pacemaker-fenced | with status 0Apr 23 11:06:06.258 testcluster-c1 pacemakerd [1295869] (pcmk_child_exit) info: pacemaker-fenced[1295871] exited with status 0 (OK)Apr 23 11:06:06.266 testcluster-c1 pacemaker-based [1295870] (crm_signal_dispatch) notice: Caught 'Terminated' signal | 15 (invoking handler)Apr 23 11:06:06.266 testcluster-c1 pacemakerd [1295869] (stop_child) notice: Stopping pacemaker-based | sent signal 15 to process 1295870Apr 23 11:06:06.274 testcluster-c1 pacemaker-based [1295870] (cib_shutdown) info: Disconnected 0 clientsApr 23 11:06:06.274 testcluster-c1 pacemaker-based [1295870] (cib_shutdown) info: All clients disconnected (0)Apr 23 11:06:06.282 testcluster-c1 pacemaker-based [1295870] (initiate_exit) info: Sending disconnect notification to 2 peers...Apr 23 11:06:06.334 testcluster-c1 pacemaker-based [1295870] (cib_process_shutdown_req) info: Peer testcluster-c1 is requesting to shut downApr 23 11:06:06.346 testcluster-c1 pacemaker-based [1295870] (cib_process_shutdown_req) info: Peer testcluster-c2 has acknowledged our shutdown requestApr 23 11:06:06.346 testcluster-c1 pacemaker-based [1295870] (terminate_cib) info: cib_process_shutdown_req: Exiting from mainloop...Apr 23 11:06:06.350 testcluster-c1 pacemaker-based [1295870] (crm_cluster_disconnect) info: Disconnecting from corosync cluster infrastructureApr 23 11:06:06.358 testcluster-c1 pacemaker-based [1295870] (pcmk__corosync_disconnect) notice: Disconnected from CorosyncApr 23 11:06:06.366 testcluster-c1 pacemaker-based [1295870] (crm_get_peer) info: Created entry 48e34a90-0596-48f1-b2b2-171d1692aac5/0x564d66683e40 for node testcluster-c2/0 (1 total)Apr 23 11:06:06.406 testcluster-c1 pacemaker-based [1295870] (cib_peer_update_callback) info: No more peersApr 23 11:06:06.422 testcluster-c1 pacemaker-based [1295870] (terminate_cib) info: cib_peer_update_callback: Exiting from mainloop...Apr 23 11:06:06.426 testcluster-c1 pacemaker-based [1295870] (crm_cluster_disconnect) info: Disconnecting from corosync cluster infrastructureApr 23 11:06:06.426 testcluster-c1 pacemaker-based [1295870] (cluster_disconnect_cpg) info: No CPG connectionApr 23 11:06:06.426 testcluster-c1 pacemaker-based [1295870] (pcmk__corosync_disconnect) notice: Disconnected from CorosyncApr 23 11:06:06.426 testcluster-c1 pacemaker-based [1295870] (qb_ipcs_us_withdraw) info: withdrawing server socketsApr 23 11:06:06.442 testcluster-c1 pacemaker-based [1295870] (qb_ipcs_us_withdraw) info: withdrawing server socketsApr 23 11:06:06.442 testcluster-c1 pacemaker-based [1295870] (qb_ipcs_us_withdraw) info: withdrawing server socketsApr 23 11:06:06.458 testcluster-c1 pacemaker-based [1295870] (crm_xml_cleanup) info: Cleaning up memory from libxml2Apr 23 11:06:09.682 testcluster-c1 pacemaker-based [1295870] (crm_exit) info: Exiting pacemaker-based | with status 0Apr 23 11:06:10.450 testcluster-c1 pacemakerd [1295869] (pcmk_child_exit) info: pacemaker-based[1295870] exited with status 0 (OK)Apr 23 11:06:10.458 testcluster-c1 pacemakerd [1295869] (pcmk_shutdown_worker) notice: Shutdown completeApr 23 11:06:10.458 testcluster-c1 pacemakerd [1295869] (pcmk_shutdown_worker) notice: Shutting down and staying down after fatal errorApr 23 11:06:10.458 testcluster-c1 pacemakerd [1295869] (pcmkd_shutdown_corosync) info: Asking Corosync to shut downApr 23 11:06:10.530 testcluster-c1 pacemakerd [1295869] (crm_xml_cleanup) info: Cleaning up memory from libxml2Apr 23 11:06:10.554 testcluster-c1 pacemakerd [1295869] (crm_exit) info: Exiting pacemakerd | with status 100On 26.04.2024, at 05:54, Reid Wahl wrote:Any logs from Pacemaker?On Thu, Apr 25, 2024 at 3:46 AM Alexander Eastwood via Users wrote:Hi all,I’m trying to get a better understanding of why our cluster - or speci
[ClusterLabs] corosync service stopping
Hi all, I’m trying to get a better understanding of why our cluster - or specifically corosync.service - entered a failed state. Here are all of the relevant corosync logs from this event, with the last line showing when I manually started the service again: Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice [CFG ] Node 1 was shut down by sysadmin Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice [SERV ] Unloading all Corosync service engines. Apr 23 11:06:10 [1295854] testcluster-c1 corosync info[QB] withdrawing server sockets Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice [SERV ] Service engine unloaded: corosync vote quorum service v1.0 Apr 23 11:06:10 [1295854] testcluster-c1 corosync info[QB] withdrawing server sockets Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice [SERV ] Service engine unloaded: corosync configuration map access Apr 23 11:06:10 [1295854] testcluster-c1 corosync info[QB] withdrawing server sockets Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice [SERV ] Service engine unloaded: corosync configuration service Apr 23 11:06:10 [1295854] testcluster-c1 corosync info[QB] withdrawing server sockets Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice [SERV ] Service engine unloaded: corosync cluster closed process group service v1.01 Apr 23 11:06:10 [1295854] testcluster-c1 corosync info[QB] withdrawing server sockets Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice [SERV ] Service engine unloaded: corosync cluster quorum service v0.1 Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice [SERV ] Service engine unloaded: corosync profile loading service Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice [SERV ] Service engine unloaded: corosync resource monitoring service Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice [SERV ] Service engine unloaded: corosync watchdog service Apr 23 11:06:11 [1295854] testcluster-c1 corosync info[KNET ] host: host: 1 (passive) best link: 0 (pri: 0) Apr 23 11:06:11 [1295854] testcluster-c1 corosync warning [KNET ] host: host: 1 has no active links Apr 23 11:06:11 [1295854] testcluster-c1 corosync notice [MAIN ] Corosync Cluster Engine exiting normally Apr 23 13:18:36 [796246] testcluster-c1 corosync notice [MAIN ] Corosync Cluster Engine 3.1.6 starting up The first line suggests that a manual shutdown of one of the cluster nodes, however neither me nor any of my colleagues did this. The ‘sysadmin’ surely must mean a person logging on to the server and running some command, as opposed to a system process? Then in the 3rd row from the bottom there is the warning “host: host: 1 has no active links” which is followed by “Corosync Cluster Engine exiting normally”. Does this mean that the reason for the Cluster Engine exiting is the fact that there are no active links? Finally, I am considering adding a systemd override file for the corosync service with the following content: [Service] Restart=on-failure Is there any reason not to do this? And, given that the process exited normally, would I need to use Restart=always instead? Many thanks Alex ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] [EXT] Prevent cluster transition when resource unavailable on both nodes
Hi, Thanks Ken and Ulrich for your replies. With your suggestions I ended up finding out about ocf:heartbeat:ethmonitor and will try to set this up as an additional resource within our cluster. I can share more information once (if!) I have it working the way I want to. Cheers, Alex > On 07.12.2023, at 08:59, Windl, Ulrich wrote: > > Hi! > > What about this: Run a ping node for a remote resource to set up some score > value. If the remote is unreachable, the score will reflect that. > Then add a rule chink that score, deciding whether to run the virtual IP or > not. > > Regards, > Ulrich > > -Original Message- > From: Users On Behalf Of Alexander Eastwood > Sent: Wednesday, December 6, 2023 5:56 PM > To: users@clusterlabs.org > Subject: [EXT] [ClusterLabs] Prevent cluster transition when resource > unavailable on both nodes > > Hello, > > I administrate a Pacemaker cluster consisting of 2 nodes, which are connected > to each other via ethernet cable to ensure that they are always able to > communicate with each other. A network switch is also connected to each node > via ethernet cable and provides external access. > > One of the managed resources of the cluster is a virtual IP, which is > assigned to a physical network interface card and thus depends on the network > switch being available. The virtual IP is always hosted on the active node. > > We had the situation where the network switch lost power or was rebooted, as > a result both servers reported `NIC Link is Down`. The recover operation on > the Virtual IP resource then failed repeatedly on the active node, and a > transition was initiated. Since the other node was also unable to start the > resource, the cluster was swaying between the 2 nodes until the NIC links > were up again. > > Is there a way to change this behaviour? I am thinking of the following > sequence of events, but have not been able to find a way to configure this: > > 1. active node detects NIC Link is Down, which affects a resource managed by > the cluster (monitor operation on the resource starts to fail) > 2. active node checks if the other (passive) node in the cluster would be > able to start the resource > 3. if passive node can start the resource, transition all resources to > passive node > 4. if passive node is unable to start the resource, then there is nothing to > be gained a transition, so no action should be taken > > Any pointers or advice will be much appreciated! > > Thank you and kind regards, > > Alex Eastwood > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Prevent cluster transition when resource unavailable on both nodes
Hello, I administrate a Pacemaker cluster consisting of 2 nodes, which are connected to each other via ethernet cable to ensure that they are always able to communicate with each other. A network switch is also connected to each node via ethernet cable and provides external access. One of the managed resources of the cluster is a virtual IP, which is assigned to a physical network interface card and thus depends on the network switch being available. The virtual IP is always hosted on the active node. We had the situation where the network switch lost power or was rebooted, as a result both servers reported `NIC Link is Down`. The recover operation on the Virtual IP resource then failed repeatedly on the active node, and a transition was initiated. Since the other node was also unable to start the resource, the cluster was swaying between the 2 nodes until the NIC links were up again. Is there a way to change this behaviour? I am thinking of the following sequence of events, but have not been able to find a way to configure this: 1. active node detects NIC Link is Down, which affects a resource managed by the cluster (monitor operation on the resource starts to fail) 2. active node checks if the other (passive) node in the cluster would be able to start the resource 3. if passive node can start the resource, transition all resources to passive node 4. if passive node is unable to start the resource, then there is nothing to be gained a transition, so no action should be taken Any pointers or advice will be much appreciated! Thank you and kind regards, Alex Eastwood ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/