[ClusterLabs] pacemaker: 1.1.23 20sec timeout on cluster with disc I/O write delays

Dmytro Poliarush via Users Tue, 17 Mar 2026 04:32:54 -0700

Hi all,
Need a small guidance on pacemaker: 1.1.23.
I'm chasing a stubborn issue in a 2node 2disc SBD cluster.


When running manual fencing test with `pcs stonith fence` command, I observe an 
error
```
    Error: unable to fence '<nodehostname>'
```
Error manifests each time around a `20second` timeout(I assume this is a 
timeout).
`time` command is used to track how long execution runs: `time pcs stonith 
fence`.
Here is an example:
```
    [root@node1 ~]# time pcs stonith fence --debug node2
    Running: /usr/sbin/stonith_admin -B node2
 >  Return Value: 194
    --Debug Output Start--
    --Debug Output End--

    Error: unable to fence 'node2'

 >  real    0m20.791s
    user    0m0.063s
    sys     0m0.033s
    [root@node1 ~]#
```

For investigation, I've setup a testing cluster with 2 Virtualbox VMs.
Behaviour was NOT observed on testing cluster until I intentionally added disk 
write delays with dmsetup tool on one of the nodes.
Here is an example of setting a 22sec write delay:
```
    # Create: read delay = 0 ms, write delay = 22000 ms
    # Table format: delay <dev> <start> <read_ms>  <dev> <start> <write_ms>
    dmsetup --noudevsync create slow-sdc --table "0 ${SIZE} delay /dev/sdc 0 0 
/dev/sdc 0 22000"
    dmsetup mknodes
```

NOTE, that tests with delays upto(including) 19sec pass:
```
    [root@node1 ~]# ./suspend-resume-slow-sdc-delay-write.sh 20000
    [root@node1 ~]# dmsetup table slow-sdc
>   0 262144 delay 8:32 0 0 8:32 0 20000
    [root@node1 ~]# time pcs stonith fence --debug node2
    Running: /usr/sbin/stonith_admin -B node2
    Return Value: 194
    --Debug Output Start--
    --Debug Output End--

```
>   Error: unable to fence 'node2'

>   real    0m20.588s
    user    0m0.088s
    sys     0m0.021s

>   [root@node1 ~]# ./suspend-resume-slow-sdc-delay-write.sh 19000
    ++ blockdev --getsize /dev/sdc
    + SIZE=262144
    ++ lsblk -dn -o MAJ:MIN /dev/sdc
    + MAJMIN='  8:32 '
    + dmsetup suspend slow-sdc
    + dmsetup reload slow-sdc --table '0 262144 delay /dev/sdc 0 0 /dev/sdc 0 
19000'
    + dmsetup resume slow-sdc
    + dmsetup table slow-sdc
>   0 262144 delay 8:32 0 0 8:32 0 19000
    [root@node1 ~]# pcs stonith history cleanup; pcs stonith cleanup # 
pcs-cleanup-error-cleanup
    cleaning up fencing-history for node *

    Cleaned up all resources on all nodes
    [root@node1 ~]#
    [root@node1 ~]# time pcs stonith fence --debug node2
    Running: /usr/sbin/stonith_admin -B node2
    Return Value: 0
    --Debug Output Start--
    --Debug Output End--

>   Node: node2 fenced

>   real    0m19.869s
    user    0m0.098s
    sys     0m0.035s
    [root@node1 ~]#
```

So here is my question:
I assume there is a 20sec timeout value hardcoded somewhere in pacemaker 1.1.23 
sources.
This hardcoded value impacts manual fencing in case of disc I/O delays(maybe in 
some other cases).
I expect that increasing timeout can mitigate clusters with disc I/O issues 
similar to ones described above.
Please note this timeout is NOT: stonith-timeout or stonith-watchdog-timeout.

Could you please comment if that is a meaningfull assumption and where does the 
20sec timeout come from.

Regards, Dmytro

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] pacemaker: 1.1.23 20sec timeout on cluster with disc I/O write delays

Reply via email to