Re: [ClusterLabs] Investigation of Corosync Heartbeat Loss: Simulating Network Failures with Redundant Network Configuration

chenzu...@gmail.com Tue, 27 May 2025 00:25:01 -0700


The server-side configuration IP addresses are similar and belong to the same 
subnet:
lustre-mds-node32
service1: 10.255.153.236
service2: 10.255.153.237
lustre-oss-node32
service1: 10.255.153.238
service2: 10.255.153.239
lustre-mds-node40
service1: 10.255.153.240
service2: 10.255.153.241
lustre-oss-node40
service1: 10.255.153.242
service2: 10.255.153.243
lustre-mds-node41
service1: 10.255.153.244
service2: 10.255.153.245
lustre-oss-node41
service1: 10.255.153.246
service2: 10.255.153.247
Root Cause
The root cause of the issue is that messages sent to service2 fail to receive a 
reply from the correct interface. Specifically, replies are being sent from 
service1 instead of service2, which leads to communication failures.
Solution
The solution involves configuring policy-based routing on the server side, 
similar to the ARP flux issue for MR node mentioned in the 
https://wiki.lustre.org/LNet_Router_Config_Guide.

chenzu...@gmail.com

From: users-request
Date: 2025-03-14 17:48
To: users
Subject: Users Digest, Vol 122, Issue 3
Send Users mailing list submissions to
users@clusterlabs.org

To subscribe or unsubscribe via the World Wide Web, visit
https://lists.clusterlabs.org/mailman/listinfo/users
or, via email, send a message with subject or body 'help' to
users-requ...@clusterlabs.org

You can reach the person managing the list at
users-ow...@clusterlabs.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Users digest..."

Today's Topics:

   1. Investigation of Corosync Heartbeat Loss: Simulating Network
      Failures with Redundant Network Configuration (chenzu...@gmail.com)

----------------------------------------------------------------------

Message: 1
Date: Fri, 14 Mar 2025 17:48:22 +0800
From: "chenzu...@gmail.com" <chenzu...@gmail.com>
To: users <users@clusterlabs.org>
Subject: [ClusterLabs] Investigation of Corosync Heartbeat Loss:
Simulating Network Failures with Redundant Network Configuration
Message-ID: <2025031417480017156...@gmail.com>
Content-Type: text/plain; charset="gb2312"

Background: 
There are 11 physical machines, with two virtual machines running on each 
physical machine.
lustre-mds-nodexx runs the Lustre MDS server, and lustre-oss-nodexx runs the 
Lustre OSS service.
Each virtual machine is directly connected to two network interfaces, service1 
and service2.
Pacemaker is used to ensure high availability of the Lustre services.
lustre(2.15.5) + corosync(3.1.5) + pacemaker(2.1.0-8.el8) + pcs(0.10.8)

Issue: During testing, the network interface service1 on lustre-oss-node30 and 
lustre-oss-node40 was repeatedly brought up and down every 1 second (to 
simulate a network failure).
The Corosync logs showed that heartbeats were lost, triggering a fencing action 
that powered off the nodes with lost heartbeats.
Given that Corosync is configured with redundant networks, why did the 
heartbeat loss occur? Is it due to a configuration issue, or is Corosync not 
designed to handle this scenario?

Other?
The configuration of corosync.conf can be found in the attached file 
corosync.conf.
Other relevant information is available in the attached file log.txt.
The script used for the up/down testing is attached as ip_up_and_down.sh.

chenzu...@gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<https://lists.clusterlabs.org/pipermail/users/attachments/20250314/e36f13fe/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: log.txt
Type: application/octet-stream
Size: 25107 bytes
Desc: not available
URL: 
<https://lists.clusterlabs.org/pipermail/users/attachments/20250314/e36f13fe/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ip_up_and_down.sh
Type: application/octet-stream
Size: 209 bytes
Desc: not available
URL: 
<https://lists.clusterlabs.org/pipermail/users/attachments/20250314/e36f13fe/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: corosync.conf
Type: application/octet-stream
Size: 1863 bytes
Desc: not available
URL: 
<https://lists.clusterlabs.org/pipermail/users/attachments/20250314/e36f13fe/attachment-0002.obj>

------------------------------

Subject: Digest Footer

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

------------------------------

End of Users Digest, Vol 122, Issue 3
*************************************

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Investigation of Corosync Heartbeat Loss: Simulating Network Failures with Redundant Network Configuration

Reply via email to