On 14/03/2025 10:48, chenzu...@gmail.com wrote:
Background:
There are 11 physical machines, with two virtual machines running on each
physical machine.
lustre-mds-nodexx runs the Lustre MDS server, and lustre-oss-nodexx runs the
Lustre OSS service.
Each virtual machine is directly connected to two network interfaces, service1
and service2.
Pacemaker is used to ensure high availability of the Lustre services.
lustre(2.15.5) + corosync(3.1.5) + pacemaker(2.1.0-8.el8) + pcs(0.10.8)
Issue: During testing, the network interface service1 on lustre-oss-node30 and
lustre-oss-node40 was repeatedly brought up and down every 1 second (to
simulate a network failure).
The Corosync logs showed that heartbeats were lost, triggering a fencing action
that powered off the nodes with lost heartbeats.
Given that Corosync is configured with redundant networks, why did the heartbeat loss
Honestly I don't think it is really configured with redundant networks.
occur? Is it due to a configuration issue, or is Corosync not designed
to handle this scenario?
Ifdown is not ideal method of testing but Corosync 3.x should be able to
handle it. Still using iptables/nftables/firewall is recommended.
Other:
The configuration of corosync.conf can be found in the attached file
corosync.conf.
From config file it looks like both rings are on the same network.
Could you please share your network configuration?
Honza
Other relevant information is available in the attached file log.txt.
The script used for the up/down testing is attached as ip_up_and_down.sh.
chenzu...@gmail.com
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/