Re: [ClusterLabs] [EXT] Investigation of Corosync Heartbeat Loss: Simulating Network Failures with Redundant Network Configuration

Windl, Ulrich Mon, 17 Mar 2025 02:22:45 -0700

Looking at
    node {
        ring0_addr: 10.255.153.159
        ring1_addr: 10.255.153.160
        name: lustre-oss-node31
        nodeid: 4
    }

I wonder how the packets are routed: What is the netmask?

Kind regards,
Ulrich Windl

From: Users <users-boun...@clusterlabs.org> On Behalf Of chenzu...@gmail.com
Sent: Friday, March 14, 2025 10:48 AM
To: users <users@clusterlabs.org>
Subject: [EXT] [ClusterLabs] Investigation of Corosync Heartbeat Loss: 
Simulating Network Failures with Redundant Network Configuration

Background:
There are 11 physical machines, with two virtual machines running on each 
physical machine.
lustre-mds-nodexx runs the Lustre MDS server, and lustre-oss-nodexx runs the 
Lustre OSS service.
Each virtual machine is directly connected to two network interfaces, service1 
and service2.
Pacemaker is used to ensure high availability of the Lustre services.
lustre(2.15.5) + corosync(3.1.5) + pacemaker(2.1.0-8.el8) + pcs(0.10.8)

Issue: During testing, the network interface service1 on lustre-oss-node30 and 
lustre-oss-node40 was repeatedly brought up and down every 1 second (to 
simulate a network failure).
The Corosync logs showed that heartbeats were lost, triggering a fencing action 
that powered off the nodes with lost heartbeats.
Given that Corosync is configured with redundant networks, why did the 
heartbeat loss occur? Is it due to a configuration issue, or is Corosync not 
designed to handle this scenario?

Other：
The configuration of corosync.conf can be found in the attached file 
corosync.conf.
Other relevant information is available in the attached file log.txt.
The script used for the up/down testing is attached as ip_up_and_down.sh.

________________________________
chenzu...@gmail.com

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] [EXT] Investigation of Corosync Heartbeat Loss: Simulating Network Failures with Redundant Network Configuration

Reply via email to