Hello Ken,

I think I have resolved the problem on my own.

Yes, right after the boot, corosync fails to come up. Problem appears to be related to name resolution. I ran corosync foreground and did a stack trace: corosync froze and strace output was suspicious with many name resolution-like calls.

In my failing cluster, I am running containerized BIND9 for regular name resolution services. Both nodes are running systemd-resolved for localhost's name resolution. Below are relevant directives of resolved.conf:

DNS=10.1.5.30
#DNS=1.2.3.4
#FallbackDNS=

10.1.5.30/29 is the virtual IP address for the nodes where BIND9 can be queried. This VIP and BIND9 container are managed by pacemaker, so after a reboot, node does NOT have the VIP and there is NO container running.

When I changed the directives as;

#DNS=10.1.5.30
DNS=1.2.3.4
#FallbackDNS=

corosync runs perfectly, successful cluster launch follows. 1.2.3.4 is a false address. Node does NOT have a default route before cluster launch. Obviously node does NOT receive any replies to its name queries while corosync is coming up. However, both nodes have a valid address, 10.1.5.25/29 and 10.1.5.26/29 after a reboot. It is a fact that 10.1.5.24/29 subnet is locally attached at both nodes.

Last discovery to mention is that I monitored LOCAL name resolutions while corosync starts ("sudo resolvectl monitor"). Monitoring immediately displayed PTR queries for ALL LOCAL IP addresses of the node.

Based on the above, my conclusion is -there is something going bad with name resolutions using non-existent VIP address-. In my first message, I mentioned that I was only able to recover corosync by REINSTALLING it from the repo. In order to reinstall, I was setting the default route and name server address (8.8.8.8) manually in order to run an effective "apt reinstall corosync". Hence, I was unintentionally configuring a DNS server for systemd-resolved. So it was NOT about reinstalling corosync but letting systemd-resolved use some non-local name server address.

I am using corosync/pacemaker for a couple of years in production, probably since Ubuntu Server release 21.10 and never encountered such a problem until now. I wrote an ansible playbook to toggle systemd-resolved's DNS directive, however I think this glitch SHOULD NOT exist.

I will be glad if I receive comments on the above.

Regards,


On 8/20/24 21:55, Ken Gaillot wrote:
On Mon, 2024-08-19 at 12:58 +0300, Murat Inal wrote:
[Resending the below due to message format problem]


Dear List,

I have been running two different 3-node clusters for some time. I
am
having a fatal problem with corosync: After a node failure, rebooted
node does NOT start corosync.

Clusters;

   * All nodes are running Ubuntu Server 24.04
   * corosync is 3.1.7
   * corosync-qdevice is 3.0.3
   * pacemaker is 2.1.6
   * The third node at both clusters is a quorum device. Cluster is on
     ffsplit algorithm.
   * All nodes are baremetal & attached to a dedicated kronosnet
network.
   * STONITH is enabled in one of the clusters and disabled for the
other.

corosync & pacemaker service starts (systemd) are disabled. I am
starting any cluster with the command pcs cluster start.

corosync NEVER starts AFTER a node failure (node is rebooted). There
Do you mean that the first time you run "pcs cluster start" after a
node reboot, corosync does not come up completely?

Try adding "debug: on" to the logging section of
/etc/corosync/corosync.conf

is
nothing in /var/log/corosync/corosync.log, service freezes as:

Aug 01 12:54:56 [3193] charon corosync notice  [MAIN  ] Corosync
Cluster
Engine 3.1.7 starting up
Aug 01 12:54:56 [3193] charon corosync info    [MAIN  ] Corosync
built-in features: dbus monitoring watchdog augeas systemd xmlconf
vqsim
nozzle snmp pie relro bindnow

corosync never starts kronosnet. I checked kronosnet interfaces, all
OK,
there is IP connectivity in between. If I do corosync -t, it is the
same
freeze.

I could ONLY manage to start corosync by reinstalling it: apt
reinstall
corosync ; pcs cluster start.

The above issue repeated itself at least 5-6 times. I do NOT see
anything in syslog either. I will be glad if you lead me on how to
solve
this.

Thanks,

Murat

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to