Re: [ClusterLabs] Sub-second failover detection in Corosync/Pacemaker clusters - 2026 update?

Windl, Ulrich via Users Tue, 03 Mar 2026 23:41:39 -0800

Hi!

I cannot answer your questions actually, but there is a comnpromise with all 
timeouts:
If it’s too long, you have an unnecessary service outage, but when it’s too 
short, you activate recovery mechanisms without actual need. With a sub-second 
reaction time, you may have additional trouble.
Just one example: In a SAN RAID storage a single disk with bad sectors caused 
significant read delays (while the disk retried reading and the controller did 
not mark the disk as bad (in this case the timeout was too long, helping the 
vendor not to have to replace the disk). In such a case switching to another 
node accessing the same disks would not help…

Mit kollegialen Grüßen
Ulrich Windl

From: Users <[email protected]> On Behalf Of Holger Haidinger <DE 
ERL SWD EM> via Users
Sent: Friday, February 20, 2026 4:41 PM
To: [email protected]
Cc: Holger Haidinger <DE ERL SWD EM> <[email protected]>
Subject: [EXT] [EXT] [ClusterLabs] Sub-second failover detection in 
Corosync/Pacemaker clusters - 2026 update?

Hi everyone,

I'm revisiting a thread from 2015 
(https://www.mail-archive.com/[email protected]/msg00554.html) about 
achieving sub-second failover detection in HA clusters, and I'm curious about 
the current state of affairs nearly a decade later.

My Environment:

- Corosync 3.1.6
- Pacemaker 2.1.2
- Architecture: 2-node cluster + QDevice (also testing 3-node setups)
- Network: Dedicated physical NIC for cluster traffic (low-latency requirements)

Specific Questions:

1. With modern Corosync/Pacemaker versions, is sub-second fault detection and 
failover initiation realistically achievable in production environments?
2. Are there any published measurements or community experiences showing the 
fastest stable failover times you've achieved? What's considered a reliable 
minimum time span?
3. Have there been significant enhancements in the newer versions of Corosync 
and Pacemaker (post-2015) that specifically target detection speed and failover 
latency?
4. If sub-second detection is possible, what are the key configuration 
parameters and potential trade-offs (false positives, network sensitivity, 
resource overhead)?

Thanks in advance!

Holger Haidinger

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Sub-second failover detection in Corosync/Pacemaker clusters - 2026 update?

Reply via email to