Re: [ClusterLabs] corosync service stopping

2024-04-30 Thread Alexander Eastwood via Users
Hi Honza

I would say there is still a certain ambiguity in "shutdown by cfg request”, 
but I would argue that by not using the term “sysadmin” it at least doesn’t 
suggest that the shutdown was triggered by a human. So yes, I think that this 
phrasing is less misleading.

Cheers,

Alex

> On 29.04.2024, at 09:56, Jan Friesse  wrote:
> 
> Hi,
> I will reply just to "sysadmin" question:
> 
> On 26/04/2024 14:43, Alexander Eastwood via Users wrote:
>> Dear Reid,
> ...
> 
>> Why does the corosync log say ’shutdown by sysadmin’ when the shutdown was 
>> triggered by pacemaker? Isn’t this misleading?
> 
> This basically means shutdown was triggered by calling corosync cfg api. I 
> can agree "sysadmin" is misleading. Problem is, same cfg api call is used by 
> corosync-cfgtool and corosync-cfgtool is used in systemd service file and 
> here it is really probably sysadmin who initiated the shutdown.
> 
> Currently the function where this log message is printed has no information 
> about which process initiated shutdown. It knows only nodeid.
> 
> It would be possible to log some more info (probably also with proc_name) in 
> the cfg API function call, but then it is probably good candidate for DEBUG 
> log level.
> 
> So do you think "shutdown by cfg request" would be less misleading?
> 
> Regards
>  Honza
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] corosync service stopping

2024-04-26 Thread Alexander Eastwood via Users
xit)      info: pacemaker-execd[1295872] exited with status 0 (OK)Apr 23 11:06:03.970 testcluster-c1 pacemakerd          [1295869] (stop_child)   notice: Stopping pacemaker-fenced | sent signal 15 to process 1295871Apr 23 11:06:03.978 testcluster-c1 pacemaker-fenced    [1295871] (crm_signal_dispatch)  notice: Caught 'Terminated' signal | 15 (invoking handler)Apr 23 11:06:04.010 testcluster-c1 pacemaker-fenced    [1295871] (stonith_shutdown)     info: Terminating with 0 clientsApr 23 11:06:04.630 testcluster-c1 pacemaker-fenced    [1295871] (cib_connection_destroy)       info: Connection to the CIB manager closedApr 23 11:06:04.702 testcluster-c1 pacemaker-fenced    [1295871] (qb_ipcs_us_withdraw)  info: withdrawing server socketsApr 23 11:06:04.778 testcluster-c1 pacemaker-fenced    [1295871] (crm_xml_cleanup)      info: Cleaning up memory from libxml2Apr 23 11:06:04.842 testcluster-c1 pacemaker-fenced    [1295871] (crm_exit)     info: Exiting pacemaker-fenced | with status 0Apr 23 11:06:06.258 testcluster-c1 pacemakerd          [1295869] (pcmk_child_exit)      info: pacemaker-fenced[1295871] exited with status 0 (OK)Apr 23 11:06:06.266 testcluster-c1 pacemaker-based     [1295870] (crm_signal_dispatch)  notice: Caught 'Terminated' signal | 15 (invoking handler)Apr 23 11:06:06.266 testcluster-c1 pacemakerd          [1295869] (stop_child)   notice: Stopping pacemaker-based | sent signal 15 to process 1295870Apr 23 11:06:06.274 testcluster-c1 pacemaker-based     [1295870] (cib_shutdown)         info: Disconnected 0 clientsApr 23 11:06:06.274 testcluster-c1 pacemaker-based     [1295870] (cib_shutdown)         info: All clients disconnected (0)Apr 23 11:06:06.282 testcluster-c1 pacemaker-based     [1295870] (initiate_exit)        info: Sending disconnect notification to 2 peers...Apr 23 11:06:06.334 testcluster-c1 pacemaker-based     [1295870] (cib_process_shutdown_req)     info: Peer testcluster-c1 is requesting to shut downApr 23 11:06:06.346 testcluster-c1 pacemaker-based     [1295870] (cib_process_shutdown_req)     info: Peer testcluster-c2 has acknowledged our shutdown requestApr 23 11:06:06.346 testcluster-c1 pacemaker-based     [1295870] (terminate_cib)        info: cib_process_shutdown_req: Exiting from mainloop...Apr 23 11:06:06.350 testcluster-c1 pacemaker-based     [1295870] (crm_cluster_disconnect)       info: Disconnecting from corosync cluster infrastructureApr 23 11:06:06.358 testcluster-c1 pacemaker-based     [1295870] (pcmk__corosync_disconnect)    notice: Disconnected from CorosyncApr 23 11:06:06.366 testcluster-c1 pacemaker-based     [1295870] (crm_get_peer)         info: Created entry 48e34a90-0596-48f1-b2b2-171d1692aac5/0x564d66683e40 for node testcluster-c2/0 (1 total)Apr 23 11:06:06.406 testcluster-c1 pacemaker-based     [1295870] (cib_peer_update_callback)     info: No more peersApr 23 11:06:06.422 testcluster-c1 pacemaker-based     [1295870] (terminate_cib)        info: cib_peer_update_callback: Exiting from mainloop...Apr 23 11:06:06.426 testcluster-c1 pacemaker-based     [1295870] (crm_cluster_disconnect)       info: Disconnecting from corosync cluster infrastructureApr 23 11:06:06.426 testcluster-c1 pacemaker-based     [1295870] (cluster_disconnect_cpg)       info: No CPG connectionApr 23 11:06:06.426 testcluster-c1 pacemaker-based     [1295870] (pcmk__corosync_disconnect)    notice: Disconnected from CorosyncApr 23 11:06:06.426 testcluster-c1 pacemaker-based     [1295870] (qb_ipcs_us_withdraw)  info: withdrawing server socketsApr 23 11:06:06.442 testcluster-c1 pacemaker-based     [1295870] (qb_ipcs_us_withdraw)  info: withdrawing server socketsApr 23 11:06:06.442 testcluster-c1 pacemaker-based     [1295870] (qb_ipcs_us_withdraw)  info: withdrawing server socketsApr 23 11:06:06.458 testcluster-c1 pacemaker-based     [1295870] (crm_xml_cleanup)      info: Cleaning up memory from libxml2Apr 23 11:06:09.682 testcluster-c1 pacemaker-based     [1295870] (crm_exit)     info: Exiting pacemaker-based | with status 0Apr 23 11:06:10.450 testcluster-c1 pacemakerd          [1295869] (pcmk_child_exit)      info: pacemaker-based[1295870] exited with status 0 (OK)Apr 23 11:06:10.458 testcluster-c1 pacemakerd          [1295869] (pcmk_shutdown_worker)         notice: Shutdown completeApr 23 11:06:10.458 testcluster-c1 pacemakerd          [1295869] (pcmk_shutdown_worker)         notice: Shutting down and staying down after fatal errorApr 23 11:06:10.458 testcluster-c1 pacemakerd          [1295869] (pcmkd_shutdown_corosync)      info: Asking Corosync to shut downApr 23 11:06:10.530 testcluster-c1 pacemakerd          [1295869] (crm_xml_cleanup)      info: Cleaning up memory from libxml2Apr 23 11:06:10.554 testcluster-c1 pacemakerd          [1295869] (crm_exit)     info: Exiting pacemakerd | with status 100On 26.04.2024, at 05:54, Reid Wahl  wrote:Any logs from Pacemaker?On Thu, Apr 25, 2024 at 3:46 AM Alexander Eastwood via Users wrote:Hi all,I’m trying to get a better understanding of why our cluster - or speci

[ClusterLabs] corosync service stopping

2024-04-25 Thread Alexander Eastwood via Users
Hi all,

I’m trying to get a better understanding of why our cluster - or specifically 
corosync.service - entered a failed state. Here are all of the relevant 
corosync logs from this event, with the last line showing when I manually 
started the service again:

Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [CFG   ] Node 1 was 
shut down by sysadmin
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Unloading 
all Corosync service engines.
Apr 23 11:06:10 [1295854] testcluster-c1 corosync info[QB] withdrawing 
server sockets
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
engine unloaded: corosync vote quorum service v1.0
Apr 23 11:06:10 [1295854] testcluster-c1 corosync info[QB] withdrawing 
server sockets
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
engine unloaded: corosync configuration map access
Apr 23 11:06:10 [1295854] testcluster-c1 corosync info[QB] withdrawing 
server sockets
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
engine unloaded: corosync configuration service
Apr 23 11:06:10 [1295854] testcluster-c1 corosync info[QB] withdrawing 
server sockets
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
engine unloaded: corosync cluster closed process group service v1.01
Apr 23 11:06:10 [1295854] testcluster-c1 corosync info[QB] withdrawing 
server sockets
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
engine unloaded: corosync cluster quorum service v0.1
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
engine unloaded: corosync profile loading service
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
engine unloaded: corosync resource monitoring service
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service 
engine unloaded: corosync watchdog service
Apr 23 11:06:11 [1295854] testcluster-c1 corosync info[KNET  ] host: host: 
1 (passive) best link: 0 (pri: 0)
Apr 23 11:06:11 [1295854] testcluster-c1 corosync warning [KNET  ] host: host: 
1 has no active links
Apr 23 11:06:11 [1295854] testcluster-c1 corosync notice  [MAIN  ] Corosync 
Cluster Engine exiting normally
Apr 23 13:18:36 [796246] testcluster-c1 corosync notice  [MAIN  ] Corosync 
Cluster Engine 3.1.6 starting up

The first line suggests that a manual shutdown of one of the cluster nodes, 
however neither me nor any of my colleagues did this. The ‘sysadmin’ surely 
must mean a person logging on to the server and running some command, as 
opposed to a system process?

Then in the 3rd row from the bottom there is the warning “host: host: 1 has no 
active links” which is followed by “Corosync Cluster Engine exiting normally”. 
Does this mean that the reason for the Cluster Engine exiting is the fact that 
there are no active links? 

Finally, I am considering adding a systemd override file for the corosync 
service with the following content:

[Service]
Restart=on-failure

Is there any reason not to do this? And, given that the process exited 
normally, would I need to use Restart=always instead?

Many thanks

Alex
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] [EXT] Prevent cluster transition when resource unavailable on both nodes

2023-12-11 Thread Alexander Eastwood
Hi,

Thanks Ken and Ulrich for your replies. With your suggestions I ended up 
finding out about ocf:heartbeat:ethmonitor and will try to set this up as an 
additional resource within our cluster.

I can share more information once (if!) I have it working the way I want to.

Cheers,

Alex

> On 07.12.2023, at 08:59, Windl, Ulrich  wrote:
> 
> Hi!
> 
> What about this: Run a ping node for a remote resource to set up some score 
> value. If the remote is unreachable, the score will reflect that.
> Then add a rule chink that score, deciding whether to run the virtual IP or 
> not.
> 
> Regards,
> Ulrich
> 
> -Original Message-
> From: Users  On Behalf Of Alexander Eastwood
> Sent: Wednesday, December 6, 2023 5:56 PM
> To: users@clusterlabs.org
> Subject: [EXT] [ClusterLabs] Prevent cluster transition when resource 
> unavailable on both nodes
> 
> Hello, 
> 
> I administrate a Pacemaker cluster consisting of 2 nodes, which are connected 
> to each other via ethernet cable to ensure that they are always able to 
> communicate with each other. A network switch is also connected to each node 
> via ethernet cable and provides external access.
> 
> One of the managed resources of the cluster is a virtual IP, which is 
> assigned to a physical network interface card and thus depends on the network 
> switch being available. The virtual IP is always hosted on the active node.
> 
> We had the situation where the network switch lost power or was rebooted, as 
> a result both servers reported `NIC Link is Down`. The recover operation on 
> the Virtual IP resource then failed repeatedly on the active node, and a 
> transition was initiated. Since the other node was also unable to start the 
> resource, the cluster was swaying between the 2 nodes until the NIC links 
> were up again.
> 
> Is there a way to change this behaviour? I am thinking of the following 
> sequence of events, but have not been able to find a way to configure this:
> 
> 1. active node detects NIC Link is Down, which affects a resource managed by 
> the cluster (monitor operation on the resource starts to fail)
> 2. active node checks if the other (passive) node in the cluster would be 
> able to start the resource
> 3. if passive node can start the resource, transition all resources to 
> passive node
> 4. if passive node is unable to start the resource, then there is nothing to 
> be gained a transition, so no action should be taken
> 
> Any pointers or advice will be much appreciated!
> 
> Thank you and kind regards,
> 
> Alex Eastwood
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Prevent cluster transition when resource unavailable on both nodes

2023-12-06 Thread Alexander Eastwood
Hello, 

I administrate a Pacemaker cluster consisting of 2 nodes, which are connected 
to each other via ethernet cable to ensure that they are always able to 
communicate with each other. A network switch is also connected to each node 
via ethernet cable and provides external access.

One of the managed resources of the cluster is a virtual IP, which is assigned 
to a physical network interface card and thus depends on the network switch 
being available. The virtual IP is always hosted on the active node.

We had the situation where the network switch lost power or was rebooted, as a 
result both servers reported `NIC Link is Down`. The recover operation on the 
Virtual IP resource then failed repeatedly on the active node, and a transition 
was initiated. Since the other node was also unable to start the resource, the 
cluster was swaying between the 2 nodes until the NIC links were up again.

Is there a way to change this behaviour? I am thinking of the following 
sequence of events, but have not been able to find a way to configure this:

 1. active node detects NIC Link is Down, which affects a resource managed by 
the cluster (monitor operation on the resource starts to fail)
 2. active node checks if the other (passive) node in the cluster would be able 
to start the resource
 3. if passive node can start the resource, transition all resources to passive 
node
 4. if passive node is unable to start the resource, then there is nothing to 
be gained a transition, so no action should be taken

Any pointers or advice will be much appreciated!

Thank you and kind regards,

Alex Eastwood
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/