Hello Ken,
I think I have resolved the problem on my own.
Yes, right after the boot, corosync fails to come up. Problem appears to
be related to name resolution. I ran corosync foreground and did a stack
trace: corosync froze and strace output was suspicious with many name
resolution-like calls.
In my failing cluster, I am running containerized BIND9 for regular name
resolution services. Both nodes are running systemd-resolved for
localhost's name resolution. Below are relevant directives of resolved.conf:
DNS=10.1.5.30
#DNS=1.2.3.4
#FallbackDNS=
10.1.5.30/29 is the virtual IP address for the nodes where BIND9 can be
queried. This VIP and BIND9 container are managed by pacemaker, so after
a reboot, node does NOT have the VIP and there is NO container running.
When I changed the directives as;
#DNS=10.1.5.30
DNS=1.2.3.4
#FallbackDNS=
corosync runs perfectly, successful cluster launch follows. 1.2.3.4 is a
false address. Node does NOT have a default route before cluster launch.
Obviously node does NOT receive any replies to its name queries while
corosync is coming up. However, both nodes have a valid address,
10.1.5.25/29 and 10.1.5.26/29 after a reboot. It is a fact that
10.1.5.24/29 subnet is locally attached at both nodes.
Last discovery to mention is that I monitored LOCAL name resolutions
while corosync starts ("sudo resolvectl monitor"). Monitoring
immediately displayed PTR queries for ALL LOCAL IP addresses of the node.
Based on the above, my conclusion is -there is something going bad with
name resolutions using non-existent VIP address-. In my first message, I
mentioned that I was only able to recover corosync by REINSTALLING it
from the repo. In order to reinstall, I was setting the default route
and name server address (8.8.8.8) manually in order to run an effective
"apt reinstall corosync". Hence, I was unintentionally configuring a DNS
server for systemd-resolved. So it was NOT about reinstalling corosync
but letting systemd-resolved use some non-local name server address.
I am using corosync/pacemaker for a couple of years in production,
probably since Ubuntu Server release 21.10 and never encountered such a
problem until now. I wrote an ansible playbook to toggle
systemd-resolved's DNS directive, however I think this glitch SHOULD NOT
exist.
I will be glad if I receive comments on the above.
Regards,
On 8/20/24 21:55, Ken Gaillot wrote:
On Mon, 2024-08-19 at 12:58 +0300, Murat Inal wrote:
[Resending the below due to message format problem]
Dear List,
I have been running two different 3-node clusters for some time. I
am
having a fatal problem with corosync: After a node failure, rebooted
node does NOT start corosync.
Clusters;
* All nodes are running Ubuntu Server 24.04
* corosync is 3.1.7
* corosync-qdevice is 3.0.3
* pacemaker is 2.1.6
* The third node at both clusters is a quorum device. Cluster is on
ffsplit algorithm.
* All nodes are baremetal & attached to a dedicated kronosnet
network.
* STONITH is enabled in one of the clusters and disabled for the
other.
corosync & pacemaker service starts (systemd) are disabled. I am
starting any cluster with the command pcs cluster start.
corosync NEVER starts AFTER a node failure (node is rebooted). There
Do you mean that the first time you run "pcs cluster start" after a
node reboot, corosync does not come up completely?
Try adding "debug: on" to the logging section of
/etc/corosync/corosync.conf
is
nothing in /var/log/corosync/corosync.log, service freezes as:
Aug 01 12:54:56 [3193] charon corosync notice [MAIN ] Corosync
Cluster
Engine 3.1.7 starting up
Aug 01 12:54:56 [3193] charon corosync info [MAIN ] Corosync
built-in features: dbus monitoring watchdog augeas systemd xmlconf
vqsim
nozzle snmp pie relro bindnow
corosync never starts kronosnet. I checked kronosnet interfaces, all
OK,
there is IP connectivity in between. If I do corosync -t, it is the
same
freeze.
I could ONLY manage to start corosync by reinstalling it: apt
reinstall
corosync ; pcs cluster start.
The above issue repeated itself at least 5-6 times. I do NOT see
anything in syslog either. I will be glad if you lead me on how to
solve
this.
Thanks,
Murat
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/