Re: [ClusterLabs] corosync won't start after node failure

Murat Inal Wed, 11 Sep 2024 14:28:08 -0700

Hello Ken,

I think I have resolved the problem on my own.

Yes, right after the boot, corosync fails to come up. Problem appears tobe related to name resolution. I ran corosync foreground and did a stacktrace: corosync froze and strace output was suspicious with many nameresolution-like calls.

In my failing cluster, I am running containerized BIND9 for regular nameresolution services. Both nodes are running systemd-resolved forlocalhost's name resolution. Below are relevant directives of resolved.conf:


DNS=10.1.5.30
#DNS=1.2.3.4
#FallbackDNS=

10.1.5.30/29 is the virtual IP address for the nodes where BIND9 can bequeried. This VIP and BIND9 container are managed by pacemaker, so aftera reboot, node does NOT have the VIP and there is NO container running.


When I changed the directives as;

#DNS=10.1.5.30
DNS=1.2.3.4
#FallbackDNS=

corosync runs perfectly, successful cluster launch follows. 1.2.3.4 is afalse address. Node does NOT have a default route before cluster launch.Obviously node does NOT receive any replies to its name queries whilecorosync is coming up. However, both nodes have a valid address,10.1.5.25/29 and 10.1.5.26/29 after a reboot. It is a fact that10.1.5.24/29 subnet is locally attached at both nodes.

Last discovery to mention is that I monitored LOCAL name resolutionswhile corosync starts ("sudo resolvectl monitor"). Monitoringimmediately displayed PTR queries for ALL LOCAL IP addresses of the node.

Based on the above, my conclusion is -there is something going bad withname resolutions using non-existent VIP address-. In my first message, Imentioned that I was only able to recover corosync by REINSTALLING itfrom the repo. In order to reinstall, I was setting the default routeand name server address (8.8.8.8) manually in order to run an effective"apt reinstall corosync". Hence, I was unintentionally configuring a DNSserver for systemd-resolved. So it was NOT about reinstalling corosyncbut letting systemd-resolved use some non-local name server address.

I am using corosync/pacemaker for a couple of years in production,probably since Ubuntu Server release 21.10 and never encountered such aproblem until now. I wrote an ansible playbook to togglesystemd-resolved's DNS directive, however I think this glitch SHOULD NOTexist.


I will be glad if I receive comments on the above.

Regards,


On 8/20/24 21:55, Ken Gaillot wrote:

On Mon, 2024-08-19 at 12:58 +0300, Murat Inal wrote:

[Resending the below due to message format problem]


Dear List,

I have been running two different 3-node clusters for some time. I
am
having a fatal problem with corosync: After a node failure, rebooted
node does NOT start corosync.

Clusters;

   * All nodes are running Ubuntu Server 24.04
   * corosync is 3.1.7
   * corosync-qdevice is 3.0.3
   * pacemaker is 2.1.6
   * The third node at both clusters is a quorum device. Cluster is on
     ffsplit algorithm.
   * All nodes are baremetal & attached to a dedicated kronosnet
network.
   * STONITH is enabled in one of the clusters and disabled for the
other.

corosync & pacemaker service starts (systemd) are disabled. I am
starting any cluster with the command pcs cluster start.

corosync NEVER starts AFTER a node failure (node is rebooted). There

Do you mean that the first time you run "pcs cluster start" after a
node reboot, corosync does not come up completely?

Try adding "debug: on" to the logging section of
/etc/corosync/corosync.conf

is
nothing in /var/log/corosync/corosync.log, service freezes as:

Aug 01 12:54:56 [3193] charon corosync notice  [MAIN  ] Corosync
Cluster
Engine 3.1.7 starting up
Aug 01 12:54:56 [3193] charon corosync info    [MAIN  ] Corosync
built-in features: dbus monitoring watchdog augeas systemd xmlconf
vqsim
nozzle snmp pie relro bindnow

corosync never starts kronosnet. I checked kronosnet interfaces, all
OK,
there is IP connectivity in between. If I do corosync -t, it is the
same
freeze.

I could ONLY manage to start corosync by reinstalling it: apt
reinstall
corosync ; pcs cluster start.

The above issue repeated itself at least 5-6 times. I do NOT see
anything in syslog either. I will be glad if you lead me on how to
solve
this.

Thanks,

Murat

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] corosync won't start after node failure

Reply via email to