Re: [ClusterLabs] poor performance for large resource
Hi Miroslav, Thank you for your helpful suggestions! I followed your advice and used the pcs command with the -f option to create the cluster CIB configuration, and then pushed it using pcs cluster cib-push. I'm happy to report that the configuration time has been significantly reduced! The resource configuration time has been reduced from the original 2 hours and 31 minutes to just 31 minutes. I appreciate your help in pointing me in the right direction. Thanks again for your assistance! Best regards, Zufei Chen attachment: new create bash: pcs_create.sh --- Message: 1 Date: Thu, 24 Oct 2024 14:01:57 +0200 From: Miroslav Lisik To: users@clusterlabs.org Subject: Re: [ClusterLabs] poor performance for large resource configuration Message-ID: Content-Type: text/plain; charset=UTF-8; format=flowed On 10/21/24 13:07, zufei chen wrote: > Hi all, > > background? > > 1. lustre(2.15.5) + corosync(3.1.5) + pacemaker(2.1.0-8.el8) + > pcs(0.10.8) > 2. there are 11 nodes in total, divided into 3 groups. If a node > fails within a group, the resources can only be taken over by > nodes within that group. > 3. Each node has 2 MDTs and 16 OSTs. > > Issues: > > 1. The resource configuration time progressively increases. the > second mdt-0? cost only???8s?the last?ost-175 cost??1min:37s > 2. The total time taken for the configuration is approximately 2 > hours and 31 minutes. Is there a way to improve it? > > > attachment: > create bash: pcs_create.sh > create log:?pcs_create.log > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ Hi, you could try to create cluster CIB configuration with pcs commands on a file using the '-f' option and then push it to the pacemaker all at once. pcs cluster cib > original.xml cp original.xml new.xml pcs -f new.xml ... ... pcs cluster cib-push new.xml diff-against=original.xml And then wait for the cluster to settle into stable state: crm_resource --wait Or there is pcs command since version v0.11.8: pcs status wait [] I hope this will help you to improve the performance. Regards, Miroslav -- Subject: Digest Footer ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ -- End of Users Digest, Vol 117, Issue 24 ** pcs_create.sh Description: Binary data ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Lustre MDT/OST Mount Failures During Virtual Machine Reboot with Pacemaker
Thank you for your advice. The reason I understand is as follows: During reboot, both the system and Pacemaker will unmount the Lustre resource simultaneously. If the system unmounts first and Pacemaker unmounts afterward, Pacemaker will immediately return success. However, at this point, the system's unmounting process is not yet complete, causing Pacemaker to mount on the target end, which triggers this issue. My current modification is as follows: Add the following lines to the file `/usr/lib/systemd/system/resource-agents-deps.target`: ``` After=remote-fs.target Before=shutdown.target reboot.target halt.target ``` After making this modification, the issue no longer occurs during reboot. chenzu...@gmail.com ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Investigation of Corosync Heartbeat Loss: Simulating Network Failures with Redundant Network Configuration
Background: There are 11 physical machines, with two virtual machines running on each physical machine. lustre-mds-nodexx runs the Lustre MDS server, and lustre-oss-nodexx runs the Lustre OSS service. Each virtual machine is directly connected to two network interfaces, service1 and service2. Pacemaker is used to ensure high availability of the Lustre services. lustre(2.15.5) + corosync(3.1.5) + pacemaker(2.1.0-8.el8) + pcs(0.10.8) Issue: During testing, the network interface service1 on lustre-oss-node30 and lustre-oss-node40 was repeatedly brought up and down every 1 second (to simulate a network failure). The Corosync logs showed that heartbeats were lost, triggering a fencing action that powered off the nodes with lost heartbeats. Given that Corosync is configured with redundant networks, why did the heartbeat loss occur? Is it due to a configuration issue, or is Corosync not designed to handle this scenario? Other: The configuration of corosync.conf can be found in the attached file corosync.conf. Other relevant information is available in the attached file log.txt. The script used for the up/down testing is attached as ip_up_and_down.sh. chenzu...@gmail.com log.txt Description: Binary data ip_up_and_down.sh Description: Binary data corosync.conf Description: Binary data ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Lustre MDT/OST Mount Failures During Virtual Machine
Thank you for your advice. The reason I understand is as follows: During reboot, both the system and Pacemaker will unmount the Lustre resource simultaneously. If the system unmounts first and Pacemaker unmounts afterward, Pacemaker will immediately return success. However, at this point, the system's unmounting process is not yet complete, causing Pacemaker to mount on the target end, which triggers this issue. My current modification is as follows: Add the following lines to the file `/usr/lib/systemd/system/resource-agents-deps.target`: ``` After=remote-fs.target Before=shutdown.target reboot.target halt.target ``` After making this modification, the issue no longer occurs during reboot. chenzu...@gmail.com ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Lustre MDT/OST Mount Failures During Virtual Machine Reboot with Pacemaker
1. Background: There are three physical servers, each running a KVM virtual machine. The virtual machines host Lustre services (MGS/MDS/OSS). Pacemaker is used to ensure high availability of the Lustre services. lustre(2.15.5) + corosync(3.1.5) + pacemaker(2.1.0-8.el8) + pcs(0.10.8) 2. Problem: When a reboot command is issued on one of the virtual machines, the MDT/OST resources are taken over by the virtual machines on other nodes. However, the mounting of these resources fails during the switch (Pacemaker attempts to mount multiple times and eventually succeeds). Workaround: Before executing the reboot command, run pcs node standby to move the resources away. Question: I would like to know if this is an inherent issue with Pacemaker? 3. Analysis: From the log analysis, it appears that the MDT/OST resources are being mounted on the target node before the unmount process is completed on the source node. The Multiple Mount Protection (MMP) detects that the source node has updated the sequence number, which causes the mount operation to fail on the target node. 4. Logs: Node 28 (Source Node): Tue Feb 18 23:46:31 CST 2025reboot ll /dev/disk/by-id/virtio-ost-node28-3-36 lrwxrwxrwx 1 root root 9 Feb 18 23:47 /dev/disk/by-id/virtio-ost-node28-3-36 -> ../../vdy Tue Feb 18 23:46:31 CST 2025 * ost-36_start_0 on lustre-oss-node29 'error' (1): call=769, status='complete', exitreason='Couldn't mount device [/dev/disk/by-id/virtio-ost-node28-3-36] as /lustre/ost-36', last-rc-change='Tue Feb 18 23:46:32 2025', queued=0ms, exec=21472ms Feb 18 23:46:31 lustre-oss-node28 systemd[1]: Unmounting /lustre/ost-36... Feb 18 23:46:31 lustre-oss-node28 kernel: LDISKFS-fs warning (device vdy): kmmpd:186: czf MMP failure info: epoch:6609375025013, seq: 37, last update time: 1739893591, last update node: lustre-oss-node28, last update device: vdy Feb 18 23:46:32 lustre-oss-node28 Filesystem(ost-36)[19748]: INFO: Running stop for /dev/disk/by-id/virtio-ost-node28-3-36 on /lustre/ost-36 Feb 18 23:46:32 lustre-oss-node28 pacemaker-controld[1700]: notice: Result of stop operation for ost-36 on lustre-oss-node28: ok Feb 18 23:46:34 lustre-oss-node28 kernel: LDISKFS-fs warning (device vdy): kmmpd:258: czf set mmp seq clean Feb 18 23:46:34 lustre-oss-node28 kernel: LDISKFS-fs warning (device vdy): kmmpd:258: czf MMP failure info: epoch:6612033802827, seq: 4283256144, last update time: 1739893594, last update node: lustre-oss-node28, last update device: vdy Feb 18 23:46:34 lustre-oss-node28 systemd[1]: Unmounted /lustre/ost-36. Node 29 (Target Node): /dev/disk/by-id/virtio-ost-node28-3-36 -> ../../vdt Feb 18 23:46:32 lustre-oss-node29 Filesystem(ost-36)[451114]: INFO: Running start for /dev/disk/by-id/virtio-ost-node28-3-36 on /lustre/ost-36 Feb 18 23:46:32 lustre-oss-node29 kernel: LDISKFS-fs warning (device vdt): ldiskfs_multi_mount_protect:350: MMP interval 42 higher than expected, please wait. Feb 18 23:46:53 lustre-oss-node29 kernel: czf, not equel, Current time: 23974372799987 ns, 37,4283256144 Feb 18 23:46:53 lustre-oss-node29 kernel: LDISKFS-fs warning (device vdt): ldiskfs_multi_mount_protect:364: czf MMP failure info: epoch:23974372801877, seq: 4283256144, last update time: 1739893594, last update node: lustre-oss-node28, last update device: vdy chenzu...@gmail.com ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Investigation of Corosync Heartbeat Loss: Simulating Network Failures with Redundant Network Configuration
The server-side configuration IP addresses are similar and belong to the same subnet: lustre-mds-node32 service1: 10.255.153.236 service2: 10.255.153.237 lustre-oss-node32 service1: 10.255.153.238 service2: 10.255.153.239 lustre-mds-node40 service1: 10.255.153.240 service2: 10.255.153.241 lustre-oss-node40 service1: 10.255.153.242 service2: 10.255.153.243 lustre-mds-node41 service1: 10.255.153.244 service2: 10.255.153.245 lustre-oss-node41 service1: 10.255.153.246 service2: 10.255.153.247 Root Cause The root cause of the issue is that messages sent to service2 fail to receive a reply from the correct interface. Specifically, replies are being sent from service1 instead of service2, which leads to communication failures. Solution The solution involves configuring policy-based routing on the server side, similar to the ARP flux issue for MR node mentioned in the https://wiki.lustre.org/LNet_Router_Config_Guide. chenzu...@gmail.com From: users-request Date: 2025-03-14 17:48 To: users Subject: Users Digest, Vol 122, Issue 3 Send Users mailing list submissions to users@clusterlabs.org To subscribe or unsubscribe via the World Wide Web, visit https://lists.clusterlabs.org/mailman/listinfo/users or, via email, send a message with subject or body 'help' to users-requ...@clusterlabs.org You can reach the person managing the list at users-ow...@clusterlabs.org When replying, please edit your Subject line so it is more specific than "Re: Contents of Users digest..." Today's Topics: 1. Investigation of Corosync Heartbeat Loss: Simulating Network Failures with Redundant Network Configuration (chenzu...@gmail.com) -- Message: 1 Date: Fri, 14 Mar 2025 17:48:22 +0800 From: "chenzu...@gmail.com" To: users Subject: [ClusterLabs] Investigation of Corosync Heartbeat Loss: Simulating Network Failures with Redundant Network Configuration Message-ID: <2025031417480017156...@gmail.com> Content-Type: text/plain; charset="gb2312" Background: There are 11 physical machines, with two virtual machines running on each physical machine. lustre-mds-nodexx runs the Lustre MDS server, and lustre-oss-nodexx runs the Lustre OSS service. Each virtual machine is directly connected to two network interfaces, service1 and service2. Pacemaker is used to ensure high availability of the Lustre services. lustre(2.15.5) + corosync(3.1.5) + pacemaker(2.1.0-8.el8) + pcs(0.10.8) Issue: During testing, the network interface service1 on lustre-oss-node30 and lustre-oss-node40 was repeatedly brought up and down every 1 second (to simulate a network failure). The Corosync logs showed that heartbeats were lost, triggering a fencing action that powered off the nodes with lost heartbeats. Given that Corosync is configured with redundant networks, why did the heartbeat loss occur? Is it due to a configuration issue, or is Corosync not designed to handle this scenario? Other? The configuration of corosync.conf can be found in the attached file corosync.conf. Other relevant information is available in the attached file log.txt. The script used for the up/down testing is attached as ip_up_and_down.sh. chenzu...@gmail.com -- next part -- An HTML attachment was scrubbed... URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20250314/e36f13fe/attachment.htm> -- next part -- A non-text attachment was scrubbed... Name: log.txt Type: application/octet-stream Size: 25107 bytes Desc: not available URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20250314/e36f13fe/attachment.obj> -- next part -- A non-text attachment was scrubbed... Name: ip_up_and_down.sh Type: application/octet-stream Size: 209 bytes Desc: not available URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20250314/e36f13fe/attachment-0001.obj> -- next part -- A non-text attachment was scrubbed... Name: corosync.conf Type: application/octet-stream Size: 1863 bytes Desc: not available URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20250314/e36f13fe/attachment-0002.obj> -- Subject: Digest Footer ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ -- End of Users Digest, Vol 122, Issue 3 * ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Incorrect Node Fencing Issue in Lustre Cluster During Network Failure Simulation
17:54:50 [1412] lustre-mds-node40 corosync notice [QUORUM] Sync members[3]: 1 2 3 Jun 09 17:54:50 [1412] lustre-mds-node40 corosync notice [QUORUM] Sync left[1]: 4 Jun 09 17:54:50 [1412] lustre-mds-node40 corosync notice [TOTEM ] A new membership (1.45) was formed. Members left: 4 Jun 09 17:54:50 [1412] lustre-mds-node40 corosync notice [TOTEM ] Failed to receive the leave message. failed: 4 Jun 09 17:54:29 [8913] lustre-mds-node41 corosync info[KNET ] link: host: 1 link: 0 is down Jun 09 17:54:29 [8913] lustre-mds-node41 corosync info[KNET ] host: host: 1 (passive) best link: 1 (pri: 1) Jun 09 17:54:30 [8913] lustre-mds-node41 corosync info[KNET ] link: host: 1 link: 1 is down Jun 09 17:54:30 [8913] lustre-mds-node41 corosync info[KNET ] host: host: 1 (passive) best link: 1 (pri: 1) Jun 09 17:54:30 [8913] lustre-mds-node41 corosync warning [KNET ] host: host: 1 has no active links Jun 09 17:54:36 [8913] lustre-mds-node41 corosync notice [TOTEM ] Token has not been received in 8475 ms Jun 09 17:54:39 [8913] lustre-mds-node41 corosync notice [TOTEM ] A processor failed, forming new configuration: token timed out (11300ms), waiting 13560ms for consensus. Jun 09 17:54:47 [8913] lustre-mds-node41 corosync info[KNET ] rx: host: 1 link: 1 is up Jun 09 17:54:47 [8913] lustre-mds-node41 corosync info[KNET ] link: Resetting MTU for link 1 because host 1 joined Jun 09 17:54:47 [8913] lustre-mds-node41 corosync info[KNET ] host: host: 1 (passive) best link: 1 (pri: 1) Jun 09 17:54:47 [8913] lustre-mds-node41 corosync info[KNET ] pmtud: Global data MTU changed to: 1397 Jun 09 17:54:50 [8913] lustre-mds-node41 corosync notice [QUORUM] Sync members[3]: 1 2 3 Jun 09 17:54:50 [8913] lustre-mds-node41 corosync notice [QUORUM] Sync left[1]: 4 Jun 09 17:54:50 [8913] lustre-mds-node41 corosync notice [TOTEM ] A new membership (1.45) was formed. Members left: 4 Jun 09 17:54:50 [8913] lustre-mds-node41 corosync notice [TOTEM ] Failed to receive the leave message. failed: 4 Jun 09 17:54:28 [8900] lustre-mds-node42 corosync info[KNET ] link: host: 1 link: 0 is down Jun 09 17:54:28 [8900] lustre-mds-node42 corosync info[KNET ] host: host: 1 (passive) best link: 1 (pri: 1) Jun 09 17:54:30 [8900] lustre-mds-node42 corosync info[KNET ] link: host: 1 link: 1 is down Jun 09 17:54:30 [8900] lustre-mds-node42 corosync info[KNET ] host: host: 1 (passive) best link: 1 (pri: 1) Jun 09 17:54:30 [8900] lustre-mds-node42 corosync warning [KNET ] host: host: 1 has no active links Jun 09 17:54:36 [8900] lustre-mds-node42 corosync notice [TOTEM ] Token has not been received in 8475 ms Jun 09 17:54:45 [8900] lustre-mds-node42 corosync info[KNET ] rx: host: 1 link: 1 is up Jun 09 17:54:45 [8900] lustre-mds-node42 corosync info[KNET ] link: Resetting MTU for link 1 because host 1 joined Jun 09 17:54:45 [8900] lustre-mds-node42 corosync info[KNET ] host: host: 1 (passive) best link: 1 (pri: 1) Jun 09 17:54:45 [8900] lustre-mds-node42 corosync info[KNET ] pmtud: Global data MTU changed to: 1397 Jun 09 17:54:50 [8900] lustre-mds-node42 corosync notice [QUORUM] Sync members[3]: 1 2 3 Jun 09 17:54:50 [8900] lustre-mds-node42 corosync notice [QUORUM] Sync left[1]: 4 Jun 09 17:54:50 [8900] lustre-mds-node42 corosync notice [TOTEM ] A new membership (1.45) was formed. Members left: 4 Jun 09 17:54:50 [8900] lustre-mds-node42 corosync notice [TOTEM ] Failed to receive the leave message. failed: 4 /etc/corosync/corosync.conf totem { version: 2 cluster_name: mds_cluster transport: knet crypto_cipher: aes256 crypto_hash: sha256 cluster_uuid: 11f2c4097ac44d5981769a9ed579c99e token: 1 } nodelist { node { ring0_addr: 10.255.153.240 ring1_addr: 10.255.153.241 name: lustre-mds-node40 nodeid: 1 } node { ring0_addr: 10.255.153.244 ring1_addr: 10.255.153.245 name: lustre-mds-node41 nodeid: 2 } node { ring0_addr: 10.255.153.248 ring1_addr: 10.255.153.249 name: lustre-mds-node42 nodeid: 3 } node { ring0_addr: 10.255.153.236 ring1_addr: 10.255.153.237 name: lustre-mds-node32 nodeid: 4 } } quorum { provider: corosync_votequorum } logging { to_logfile: yes logfile: /var/log/cluster/corosync.log to_syslog: yes timestamp: on } chenzu...@gmail.com ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/