Re: [ClusterLabs] pacemaker: 1.1.23 20sec timeout on cluster with disc I/O write delays

Dmytro Poliarush via Users Fri, 10 Apr 2026 02:39:10 -0700

Hi Windl,

Just a reminder about pacemaker 1.1.23 suspicious behaviour.
Would you please find time to check strace below and maybe forward me to a 
knowledgeable person.

Regards, Dmytro

________________________________
From: Dmytro Poliarush <[email protected]>
Sent: 25 March 2026 17:19
To: Windl, Ulrich <[email protected]>; Cluster Labs - All topics related to 
open-source clustering welcomed <[email protected]>
Subject: Re: pacemaker: 1.1.23 20sec timeout on cluster with disc I/O write 
delays

Hi Windl,
Thank yo very much for your prompt reply and links to timeout.
I've tried all of those already and they are NOT working.
From my observation there is some kind of hardcoded 20sec timeout in stonithd 
on pacemaker 1.1.23.
In this pacemaker version stonithd is compled from: `commands.c`, `internal.h`, 
`main.c`, `remote.c`
And we assume that 20sec timeout is hardcoded somewhere in these sources.
Most logical candidate so far was:
```
    fencing/commands.c
    #define DEFAULT_QUERY_TIMEOUT 20
```

But changing that value to 120 did NOT work.
strace still shows stonithd closing socket with stonith_admin after 20sec 
polling.
This is visible in attached: st_admin_strace.9964.comments.log:
```
         05:45:08.680800 socket(AF_UNIX, SOCK_STREAM, 0) = 5<UNIX:[420071]> 
<0.000446>
         05:45:08.694409 connect(5<UNIX:[420071]>, {sa_family=AF_UNIX, 
sun_path=@"stonith-ng\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"},
 110) = 0 <0.000810>
         05:45:08.699454 poll([{fd=5<UNIX:[420071->420078]>, events=POLLIN}], 
1, 0) = 0 (Timeout) <0.000004>
         05:45:08.699719 poll([{fd=5<UNIX:[420071->420078]>, events=POLLIN}], 
1, 0) = 0 (Timeout) <0.000005>
         ... more polling on fd=5 here ...
         05:45:08.700324 poll([{fd=5<UNIX:[420071->420078]>, events=POLLIN}], 
1, 0) = 0 (Timeout) <0.000006>
         05:45:09.099605 poll([{fd=5<UNIX:[420071->420078]>, events=POLLIN}], 
1, 0) = 1 ([{fd=5, revents=POLLIN}]) <0.000092>
         05:45:29.344300 shutdown(5<UNIX:[420071->420078]>, SHUT_RDWR) = 0 
<0.000022>
         05:45:29.344391 close(5<UNIX:[420071->420078]>) = 0 <0.000030>
         05:45:29.346138 exit_group(-62)         = ?
         05:45:29.347107 +++ exited with 194 +++

```

What is MOST important here that strace top log shows stonith_admin started 
with `-t 60` (timeout in seconds).
    05:45:08.659305 execve("/usr/sbin/stonith_admin", ["stonith_admin", "-VVV", 
"-t", "60", "-B", "node2"], 0x7ffd2b75e158 /* 33 vars */) = 0 <0.000162>

But somehow stonithd ignores that and after 20sec polling:
         05:45:29.344300 shutdown(5<UNIX:[420071->420078]>, SHUT_RDWR) = 0 
<0.000022>

Regards, Dmytro

________________________________
From: Windl, Ulrich <[email protected]>
Sent: 23 March 2026 12:49
To: Cluster Labs - All topics related to open-source clustering welcomed 
<[email protected]>
Cc: Dmytro Poliarush <[email protected]>
Subject: RE: pacemaker: 1.1.23 20sec timeout on cluster with disc I/O write 
delays

You don't often get email from [email protected]. Learn why this is 
important<https://aka.ms/LearnAboutSenderIdentification>

I think you should provide more information, like SBD configuration, syslog 
messages, etc.

Usually node fencing via SBD works by writing a message to a shared disk slot.

Once written, SBD/pacemaker expects the node to suicide soon. However multiple 
timeouts are configurable.

Asking AI, I got this (treat with some care):

I'll search for the specific timeout parameters and their interdependencies in 
Linux Pacemaker clusters with SBD fencing.

SBD Timeout Parameters in Linux Pacemaker Clusters

In a Pacemaker cluster using SBD (STONITH Block Device) for fencing, there are 
four primary configurable timeout parameters, and they have strict 
interdependencies. Understanding these relationships is critical for reliable 
cluster operation.

________________________________

Timeout Parameters and Their Interdependencies

Parameter

Configuration Location

Type

Purpose

Default

SBD_WATCHDOG_TIMEOUT

/etc/sysconfig/sbd

SBD daemon config

Hardware watchdog timeout; triggers node self-fence if no kick received

5 seconds

msgwait

SBD device metadata

SBD device level

Time window for message delivery to node slot on SBD device

Set during device initialization

stonith-timeout

Pacemaker CIB (cluster property)

Global cluster property

Maximum time Pacemaker waits for STONITH action (reboot/off) to complete

60 seconds

stonith-watchdog-timeout

Pacemaker CIB (cluster property)

Global cluster property

Time Pacemaker assumes fencing has completed via watchdog (diskless SBD only)

0 (disabled by default)

________________________________

Critical Interdependencies

The timeout parameters have strict mathematical relationships that must be 
maintained for proper cluster behavior:

For Disk-Based SBD (with shared storage devices):

msgwait >= (watchdog_timeout × 2)stonith-timeout >= msgwait + 20%

Example: If watchdog timeout is 30 seconds:

  *   msgwait must be at least 60 seconds
  *   stonith-timeout must be at least 72 seconds (60 + 20%)

For Diskless SBD (watchdog-only, no shared storage):

stonith-watchdog-timeout >= (SBD_WATCHDOG_TIMEOUT × 2)stonith-timeout >= 
stonith-watchdog-timeout + 20%

Example: If SBD_WATCHDOG_TIMEOUT is 5 seconds:

  *   stonith-watchdog-timeout must be at least 10 seconds
  *   stonith-timeout must be at least 12 seconds (10 + 20%)

________________________________

How These Parameters Interact

Watchdog Timeout (SBD_WATCHDOG_TIMEOUT)

This is the foundation of the timeout hierarchy. It represents how long the 
hardware watchdog will wait for a "kick" (heartbeat) from the SBD daemon before 
forcibly resetting the node. If storage latency or system issues prevent the 
SBD daemon from operating, the node self-fences after this timeout expires.

Message Wait Timeout (msgwait)

This is set in the SBD device metadata during initialization and defines the 
grace period for a fencing message to be acknowledged as delivered to the 
target node's slot. It must be at least twice the watchdog timeout to ensure 
the node has time to detect the fencing message and self-fence gracefully 
before the watchdog triggers.

STONITH Timeout (stonith-timeout)

This is a Pacemaker cluster property that controls how long the cluster waits 
for the fencing action to complete. It must exceed msgwait by at least 20% to 
allow sufficient time for the message to be delivered and processed. If this 
timeout is too short, the cluster may consider the fencing action failed and 
retry, causing unnecessary delays.

STONITH Watchdog Timeout (stonith-watchdog-timeout)

This parameter is only used for diskless SBD and tells Pacemaker how long to 
wait before assuming a node has already self-fenced via the watchdog. It must 
be at least twice the SBD_WATCHDOG_TIMEOUT to provide a safety margin. Setting 
this to 0 (the default) disables resource recovery and is appropriate only for 
disk-based SBD configurations.

________________________________

Critical Warnings

Alice, pay attention to these constraints:

  *   Do not set stonith-watchdog-timeout until SBD is configured and running 
on every node, including Pacemaker Remote nodes.
  *   If stonith-timeout < stonith-watchdog-timeout in diskless SBD, nodes can 
become stuck in an UNCLEAN state, blocking failover.
  *   For multipath or iSCSI setups, the watchdog timeout should account for 
path failure detection and failover time. The max_polling_interval in 
/etc/multipath.conf must be less than the watchdog timeout.
  *   Changing watchdog timeout requires coordinating changes across all 
dependent timeouts to maintain the mathematical relationships.
  *   Storage latency is the primary driver of watchdog timeout values; 
high-latency storage requires longer timeouts, which cascades into longer 
msgwait and stonith-timeout values.

The interdependency structure ensures that each timeout layer provides 
sufficient time for the layer below it to complete, preventing race conditions 
and cluster deadlock scenarios.

Kind regards,

Ulrich Windl

From: Users <[email protected]> On Behalf Of Dmytro Poliarush via 
Users
Sent: Tuesday, March 17, 2026 12:32 PM
To: [email protected]
Cc: Dmytro Poliarush <[email protected]>
Subject: [EXT] [EXT] [ClusterLabs] pacemaker: 1.1.23 20sec timeout on cluster 
with disc I/O write delays

Sicherheits-Hinweis: Diese E-Mail wurde von einer Person außerhalb des UKR 
gesendet. Seien Sie vorsichtig vor gefälschten Absendern, wenn Sie auf Links 
klicken, Anhänge öffnen oder weitere Aktionen ausführen, bevor Sie die Echtheit 
überprüft haben.

Hi all,

Need a small guidance on pacemaker: 1.1.23.

I'm chasing a stubborn issue in a 2node 2disc SBD cluster.

When running manual fencing test with `pcs stonith fence` command, I observe an 
error

```

    Error: unable to fence '<nodehostname>'

```

Error manifests each time around a `20second` timeout(I assume this is a 
timeout).

`time` command is used to track how long execution runs: `time pcs stonith 
fence`.

Here is an example:

```

    [root@node1 ~]# time pcs stonith fence --debug node2

    Running: /usr/sbin/stonith_admin -B node2

 >  Return Value: 194

    --Debug Output Start--

    --Debug Output End--

    Error: unable to fence 'node2'

 >  real    0m20.791s

    user    0m0.063s

    sys     0m0.033s

    [root@node1 ~]#

```

For investigation, I've setup a testing cluster with 2 Virtualbox VMs.

Behaviour was NOT observed on testing cluster until I intentionally added disk 
write delays with dmsetup tool on one of the nodes.

Here is an example of setting a 22sec write delay:

```

    # Create: read delay = 0 ms, write delay = 22000 ms

    # Table format: delay <dev> <start> <read_ms>  <dev> <start> <write_ms>

    dmsetup --noudevsync create slow-sdc --table "0 ${SIZE} delay /dev/sdc 0 0 
/dev/sdc 0 22000"

    dmsetup mknodes

```

NOTE, that tests with delays upto(including) 19sec pass:

```

    [root@node1 ~]# ./suspend-resume-slow-sdc-delay-write.sh 20000

    [root@node1 ~]# dmsetup table slow-sdc

>   0 262144 delay 8:32 0 0 8:32 0 20000

    [root@node1 ~]# time pcs stonith fence --debug node2

    Running: /usr/sbin/stonith_admin -B node2

    Return Value: 194

    --Debug Output Start--

    --Debug Output End--

```

>   Error: unable to fence 'node2'

>   real    0m20.588s

    user    0m0.088s

    sys     0m0.021s

>   [root@node1 ~]# ./suspend-resume-slow-sdc-delay-write.sh 19000

    ++ blockdev --getsize /dev/sdc

    + SIZE=262144

    ++ lsblk -dn -o MAJ:MIN /dev/sdc

    + MAJMIN='  8:32 '

    + dmsetup suspend slow-sdc

    + dmsetup reload slow-sdc --table '0 262144 delay /dev/sdc 0 0 /dev/sdc 0 
19000'

    + dmsetup resume slow-sdc

    + dmsetup table slow-sdc

>   0 262144 delay 8:32 0 0 8:32 0 19000

    [root@node1 ~]# pcs stonith history cleanup; pcs stonith cleanup # 
pcs-cleanup-error-cleanup

    cleaning up fencing-history for node *

    Cleaned up all resources on all nodes

    [root@node1 ~]#

    [root@node1 ~]# time pcs stonith fence --debug node2

    Running: /usr/sbin/stonith_admin -B node2

    Return Value: 0

    --Debug Output Start--

    --Debug Output End--

>   Node: node2 fenced

>   real    0m19.869s

    user    0m0.098s

    sys     0m0.035s

    [root@node1 ~]#

```

So here is my question:

I assume there is a 20sec timeout value hardcoded somewhere in pacemaker 1.1.23 
sources.

This hardcoded value impacts manual fencing in case of disc I/O delays(maybe in 
some other cases).

I expect that increasing timeout can mitigate clusters with disc I/O issues 
similar to ones described above.

Please note this timeout is NOT: stonith-timeout or stonith-watchdog-timeout.

Could you please comment if that is a meaningfull assumption and where does the 
20sec timeout come from.

Regards, Dmytro

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] pacemaker: 1.1.23 20sec timeout on cluster with disc I/O write delays

Reply via email to