Background: There are 4 physical machines, with two virtual machines running on each physical machine. lustre-mds-nodexx runs the Lustre MDS server, and lustre-oss-nodexx runs the Lustre OSS service. Each virtual machine is directly connected to two network interfaces, service1(ens6f0np0) and service2(ens6f1np1). Pacemaker is used to ensure high availability of the Lustre services. Software versions: Lustre: 2.15.5 Corosync: 3.1.5 Pacemaker: 2.1.0-8.el8 PCS: 0.10.8
Operation: During testing, the network interfaces service1 and service2 on lustre-oss-node40 and lustre-mds-node40 were repeatedly brought up and down every 20 seconds (to simulate a network failure). for i in {1..10}; do date; ifconfig ens6f0np0 down && ifconfig ens6f1np1 down; sleep 20; date; ifconfig ens6f0np0 up && ifconfig ens6f1np1 up; date;sleep 30 Issue: Theoretically, lustre-oss-node40 and lustre-mds-node40 should have been fenced, but lustre-mds-node32 was fenced instead. Related Logs: Jun 09 17:54:51 node32 fence_virtd[2502]: Destroying domain 60e80c07-107e-4e8a-ba42-39e48b3e6bb7 // This log indicates that lustre-mds-node32 was fenced. * turning off of lustre-mds-node32 successful: delegate=lustre-mds-node42, client=pacemaker-controld.8918, origin=lustre-mds-node42, completed='2025-06-09 17:54:54.527116 +08:00' Jun 09 17:54:10 [1429] lustre-mds-node32 corosync info [KNET ] link: Resetting MTU for link 0 because host 1 joined Jun 09 17:54:10 [1429] lustre-mds-node32 corosync info [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) Jun 09 17:54:10 [1429] lustre-mds-node32 corosync info [KNET ] pmtud: Global data MTU changed to: 1397 Jun 09 17:54:31 [1429] lustre-mds-node32 corosync info [KNET ] link: host: 1 link: 0 is down Jun 09 17:54:31 [1429] lustre-mds-node32 corosync info [KNET ] host: host: 1 (passive) best link: 1 (pri: 1) Jun 09 17:54:34 [1429] lustre-mds-node32 corosync info [KNET ] link: host: 1 link: 1 is down Jun 09 17:54:34 [1429] lustre-mds-node32 corosync info [KNET ] host: host: 1 (passive) best link: 1 (pri: 1) Jun 09 17:54:34 [1429] lustre-mds-node32 corosync warning [KNET ] host: host: 1 has no active links Jun 09 17:54:36 [1429] lustre-mds-node32 corosync notice [TOTEM ] Token has not been received in 8475 ms Jun 09 17:57:44 [1419] lustre-mds-node32 corosync notice [MAIN ] Corosync Cluster Engine 3.1.8 starting up Jun 09 17:54:31 [1412] lustre-mds-node40 corosync info [KNET ] link: host: 4 link: 0 is down Jun 09 17:54:31 [1412] lustre-mds-node40 corosync info [KNET ] link: host: 3 link: 0 is down Jun 09 17:54:31 [1412] lustre-mds-node40 corosync info [KNET ] link: host: 2 link: 0 is down Jun 09 17:54:31 [1412] lustre-mds-node40 corosync info [KNET ] host: host: 4 (passive) best link: 1 (pri: 1) Jun 09 17:54:31 [1412] lustre-mds-node40 corosync info [KNET ] host: host: 3 (passive) best link: 1 (pri: 1) Jun 09 17:54:31 [1412] lustre-mds-node40 corosync info [KNET ] host: host: 2 (passive) best link: 1 (pri: 1) Jun 09 17:54:34 [1412] lustre-mds-node40 corosync notice [TOTEM ] Token has not been received in 8475 ms Jun 09 17:54:34 [1412] lustre-mds-node40 corosync info [KNET ] link: host: 4 link: 1 is down Jun 09 17:54:34 [1412] lustre-mds-node40 corosync info [KNET ] link: host: 3 link: 1 is down Jun 09 17:54:34 [1412] lustre-mds-node40 corosync info [KNET ] link: host: 2 link: 1 is down Jun 09 17:54:34 [1412] lustre-mds-node40 corosync info [KNET ] host: host: 4 (passive) best link: 1 (pri: 1) Jun 09 17:54:34 [1412] lustre-mds-node40 corosync warning [KNET ] host: host: 4 has no active links Jun 09 17:54:34 [1412] lustre-mds-node40 corosync info [KNET ] host: host: 3 (passive) best link: 1 (pri: 1) Jun 09 17:54:34 [1412] lustre-mds-node40 corosync warning [KNET ] host: host: 3 has no active links Jun 09 17:54:34 [1412] lustre-mds-node40 corosync info [KNET ] host: host: 2 (passive) best link: 1 (pri: 1) Jun 09 17:54:34 [1412] lustre-mds-node40 corosync warning [KNET ] host: host: 2 has no active links Jun 09 17:54:37 [1412] lustre-mds-node40 corosync notice [TOTEM ] A processor failed, forming new configuration: token timed out (11300ms), waiting 13560ms for consensus. Jun 09 17:54:46 [1412] lustre-mds-node40 corosync info [KNET ] link: Resetting MTU for link 1 because host 3 joined Jun 09 17:54:46 [1412] lustre-mds-node40 corosync info [KNET ] host: host: 3 (passive) best link: 1 (pri: 1) Jun 09 17:54:46 [1412] lustre-mds-node40 corosync info [KNET ] pmtud: Global data MTU changed to: 1397 Jun 09 17:54:47 [1412] lustre-mds-node40 corosync info [KNET ] link: Resetting MTU for link 1 because host 2 joined Jun 09 17:54:47 [1412] lustre-mds-node40 corosync info [KNET ] host: host: 2 (passive) best link: 1 (pri: 1) Jun 09 17:54:47 [1412] lustre-mds-node40 corosync info [KNET ] pmtud: Global data MTU changed to: 1397 Jun 09 17:54:50 [1412] lustre-mds-node40 corosync notice [QUORUM] Sync members[3]: 1 2 3 Jun 09 17:54:50 [1412] lustre-mds-node40 corosync notice [QUORUM] Sync left[1]: 4 Jun 09 17:54:50 [1412] lustre-mds-node40 corosync notice [TOTEM ] A new membership (1.45) was formed. Members left: 4 Jun 09 17:54:50 [1412] lustre-mds-node40 corosync notice [TOTEM ] Failed to receive the leave message. failed: 4 Jun 09 17:54:29 [8913] lustre-mds-node41 corosync info [KNET ] link: host: 1 link: 0 is down Jun 09 17:54:29 [8913] lustre-mds-node41 corosync info [KNET ] host: host: 1 (passive) best link: 1 (pri: 1) Jun 09 17:54:30 [8913] lustre-mds-node41 corosync info [KNET ] link: host: 1 link: 1 is down Jun 09 17:54:30 [8913] lustre-mds-node41 corosync info [KNET ] host: host: 1 (passive) best link: 1 (pri: 1) Jun 09 17:54:30 [8913] lustre-mds-node41 corosync warning [KNET ] host: host: 1 has no active links Jun 09 17:54:36 [8913] lustre-mds-node41 corosync notice [TOTEM ] Token has not been received in 8475 ms Jun 09 17:54:39 [8913] lustre-mds-node41 corosync notice [TOTEM ] A processor failed, forming new configuration: token timed out (11300ms), waiting 13560ms for consensus. Jun 09 17:54:47 [8913] lustre-mds-node41 corosync info [KNET ] rx: host: 1 link: 1 is up Jun 09 17:54:47 [8913] lustre-mds-node41 corosync info [KNET ] link: Resetting MTU for link 1 because host 1 joined Jun 09 17:54:47 [8913] lustre-mds-node41 corosync info [KNET ] host: host: 1 (passive) best link: 1 (pri: 1) Jun 09 17:54:47 [8913] lustre-mds-node41 corosync info [KNET ] pmtud: Global data MTU changed to: 1397 Jun 09 17:54:50 [8913] lustre-mds-node41 corosync notice [QUORUM] Sync members[3]: 1 2 3 Jun 09 17:54:50 [8913] lustre-mds-node41 corosync notice [QUORUM] Sync left[1]: 4 Jun 09 17:54:50 [8913] lustre-mds-node41 corosync notice [TOTEM ] A new membership (1.45) was formed. Members left: 4 Jun 09 17:54:50 [8913] lustre-mds-node41 corosync notice [TOTEM ] Failed to receive the leave message. failed: 4 Jun 09 17:54:28 [8900] lustre-mds-node42 corosync info [KNET ] link: host: 1 link: 0 is down Jun 09 17:54:28 [8900] lustre-mds-node42 corosync info [KNET ] host: host: 1 (passive) best link: 1 (pri: 1) Jun 09 17:54:30 [8900] lustre-mds-node42 corosync info [KNET ] link: host: 1 link: 1 is down Jun 09 17:54:30 [8900] lustre-mds-node42 corosync info [KNET ] host: host: 1 (passive) best link: 1 (pri: 1) Jun 09 17:54:30 [8900] lustre-mds-node42 corosync warning [KNET ] host: host: 1 has no active links Jun 09 17:54:36 [8900] lustre-mds-node42 corosync notice [TOTEM ] Token has not been received in 8475 ms Jun 09 17:54:45 [8900] lustre-mds-node42 corosync info [KNET ] rx: host: 1 link: 1 is up Jun 09 17:54:45 [8900] lustre-mds-node42 corosync info [KNET ] link: Resetting MTU for link 1 because host 1 joined Jun 09 17:54:45 [8900] lustre-mds-node42 corosync info [KNET ] host: host: 1 (passive) best link: 1 (pri: 1) Jun 09 17:54:45 [8900] lustre-mds-node42 corosync info [KNET ] pmtud: Global data MTU changed to: 1397 Jun 09 17:54:50 [8900] lustre-mds-node42 corosync notice [QUORUM] Sync members[3]: 1 2 3 Jun 09 17:54:50 [8900] lustre-mds-node42 corosync notice [QUORUM] Sync left[1]: 4 Jun 09 17:54:50 [8900] lustre-mds-node42 corosync notice [TOTEM ] A new membership (1.45) was formed. Members left: 4 Jun 09 17:54:50 [8900] lustre-mds-node42 corosync notice [TOTEM ] Failed to receive the leave message. failed: 4 /etc/corosync/corosync.conf totem { version: 2 cluster_name: mds_cluster transport: knet crypto_cipher: aes256 crypto_hash: sha256 cluster_uuid: 11f2c4097ac44d5981769a9ed579c99e token: 10000 } nodelist { node { ring0_addr: 10.255.153.240 ring1_addr: 10.255.153.241 name: lustre-mds-node40 nodeid: 1 } node { ring0_addr: 10.255.153.244 ring1_addr: 10.255.153.245 name: lustre-mds-node41 nodeid: 2 } node { ring0_addr: 10.255.153.248 ring1_addr: 10.255.153.249 name: lustre-mds-node42 nodeid: 3 } node { ring0_addr: 10.255.153.236 ring1_addr: 10.255.153.237 name: lustre-mds-node32 nodeid: 4 } } quorum { provider: corosync_votequorum } logging { to_logfile: yes logfile: /var/log/cluster/corosync.log to_syslog: yes timestamp: on } chenzu...@gmail.com
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/