The server-side configuration IP addresses are similar and belong to the same subnet: lustre-mds-node32 service1: 10.255.153.236 service2: 10.255.153.237 lustre-oss-node32 service1: 10.255.153.238 service2: 10.255.153.239 lustre-mds-node40 service1: 10.255.153.240 service2: 10.255.153.241 lustre-oss-node40 service1: 10.255.153.242 service2: 10.255.153.243 lustre-mds-node41 service1: 10.255.153.244 service2: 10.255.153.245 lustre-oss-node41 service1: 10.255.153.246 service2: 10.255.153.247 Root Cause The root cause of the issue is that messages sent to service2 fail to receive a reply from the correct interface. Specifically, replies are being sent from service1 instead of service2, which leads to communication failures. Solution The solution involves configuring policy-based routing on the server side, similar to the ARP flux issue for MR node mentioned in the https://wiki.lustre.org/LNet_Router_Config_Guide.
chenzu...@gmail.com From: users-request Date: 2025-03-14 17:48 To: users Subject: Users Digest, Vol 122, Issue 3 Send Users mailing list submissions to users@clusterlabs.org To subscribe or unsubscribe via the World Wide Web, visit https://lists.clusterlabs.org/mailman/listinfo/users or, via email, send a message with subject or body 'help' to users-requ...@clusterlabs.org You can reach the person managing the list at users-ow...@clusterlabs.org When replying, please edit your Subject line so it is more specific than "Re: Contents of Users digest..." Today's Topics: 1. Investigation of Corosync Heartbeat Loss: Simulating Network Failures with Redundant Network Configuration (chenzu...@gmail.com) ---------------------------------------------------------------------- Message: 1 Date: Fri, 14 Mar 2025 17:48:22 +0800 From: "chenzu...@gmail.com" <chenzu...@gmail.com> To: users <users@clusterlabs.org> Subject: [ClusterLabs] Investigation of Corosync Heartbeat Loss: Simulating Network Failures with Redundant Network Configuration Message-ID: <2025031417480017156...@gmail.com> Content-Type: text/plain; charset="gb2312" Background: There are 11 physical machines, with two virtual machines running on each physical machine. lustre-mds-nodexx runs the Lustre MDS server, and lustre-oss-nodexx runs the Lustre OSS service. Each virtual machine is directly connected to two network interfaces, service1 and service2. Pacemaker is used to ensure high availability of the Lustre services. lustre(2.15.5) + corosync(3.1.5) + pacemaker(2.1.0-8.el8) + pcs(0.10.8) Issue: During testing, the network interface service1 on lustre-oss-node30 and lustre-oss-node40 was repeatedly brought up and down every 1 second (to simulate a network failure). The Corosync logs showed that heartbeats were lost, triggering a fencing action that powered off the nodes with lost heartbeats. Given that Corosync is configured with redundant networks, why did the heartbeat loss occur? Is it due to a configuration issue, or is Corosync not designed to handle this scenario? Other? The configuration of corosync.conf can be found in the attached file corosync.conf. Other relevant information is available in the attached file log.txt. The script used for the up/down testing is attached as ip_up_and_down.sh. chenzu...@gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20250314/e36f13fe/attachment.htm> -------------- next part -------------- A non-text attachment was scrubbed... Name: log.txt Type: application/octet-stream Size: 25107 bytes Desc: not available URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20250314/e36f13fe/attachment.obj> -------------- next part -------------- A non-text attachment was scrubbed... Name: ip_up_and_down.sh Type: application/octet-stream Size: 209 bytes Desc: not available URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20250314/e36f13fe/attachment-0001.obj> -------------- next part -------------- A non-text attachment was scrubbed... Name: corosync.conf Type: application/octet-stream Size: 1863 bytes Desc: not available URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20250314/e36f13fe/attachment-0002.obj> ------------------------------ Subject: Digest Footer _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ------------------------------ End of Users Digest, Vol 122, Issue 3 *************************************
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/