Hi! I cannot answer your questions actually, but there is a comnpromise with all timeouts: If it’s too long, you have an unnecessary service outage, but when it’s too short, you activate recovery mechanisms without actual need. With a sub-second reaction time, you may have additional trouble. Just one example: In a SAN RAID storage a single disk with bad sectors caused significant read delays (while the disk retried reading and the controller did not mark the disk as bad (in this case the timeout was too long, helping the vendor not to have to replace the disk). In such a case switching to another node accessing the same disks would not help…
Mit kollegialen Grüßen Ulrich Windl From: Users <[email protected]> On Behalf Of Holger Haidinger <DE ERL SWD EM> via Users Sent: Friday, February 20, 2026 4:41 PM To: [email protected] Cc: Holger Haidinger <DE ERL SWD EM> <[email protected]> Subject: [EXT] [EXT] [ClusterLabs] Sub-second failover detection in Corosync/Pacemaker clusters - 2026 update? Hi everyone, I'm revisiting a thread from 2015 (https://www.mail-archive.com/[email protected]/msg00554.html) about achieving sub-second failover detection in HA clusters, and I'm curious about the current state of affairs nearly a decade later. My Environment: - Corosync 3.1.6 - Pacemaker 2.1.2 - Architecture: 2-node cluster + QDevice (also testing 3-node setups) - Network: Dedicated physical NIC for cluster traffic (low-latency requirements) Specific Questions: 1. With modern Corosync/Pacemaker versions, is sub-second fault detection and failover initiation realistically achievable in production environments? 2. Are there any published measurements or community experiences showing the fastest stable failover times you've achieved? What's considered a reliable minimum time span? 3. Have there been significant enhancements in the newer versions of Corosync and Pacemaker (post-2015) that specifically target detection speed and failover latency? 4. If sub-second detection is possible, what are the key configuration parameters and potential trade-offs (false positives, network sensitivity, resource overhead)? Thanks in advance! Holger Haidinger
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
