Hi all,
Need a small guidance on pacemaker: 1.1.23.
I'm chasing a stubborn issue in a 2node 2disc SBD cluster.
When running manual fencing test with `pcs stonith fence` command, I observe an
error
```
Error: unable to fence '<nodehostname>'
```
Error manifests each time around a `20second` timeout(I assume this is a
timeout).
`time` command is used to track how long execution runs: `time pcs stonith
fence`.
Here is an example:
```
[root@node1 ~]# time pcs stonith fence --debug node2
Running: /usr/sbin/stonith_admin -B node2
> Return Value: 194
--Debug Output Start--
--Debug Output End--
Error: unable to fence 'node2'
> real 0m20.791s
user 0m0.063s
sys 0m0.033s
[root@node1 ~]#
```
For investigation, I've setup a testing cluster with 2 Virtualbox VMs.
Behaviour was NOT observed on testing cluster until I intentionally added disk
write delays with dmsetup tool on one of the nodes.
Here is an example of setting a 22sec write delay:
```
# Create: read delay = 0 ms, write delay = 22000 ms
# Table format: delay <dev> <start> <read_ms> <dev> <start> <write_ms>
dmsetup --noudevsync create slow-sdc --table "0 ${SIZE} delay /dev/sdc 0 0
/dev/sdc 0 22000"
dmsetup mknodes
```
NOTE, that tests with delays upto(including) 19sec pass:
```
[root@node1 ~]# ./suspend-resume-slow-sdc-delay-write.sh 20000
[root@node1 ~]# dmsetup table slow-sdc
> 0 262144 delay 8:32 0 0 8:32 0 20000
[root@node1 ~]# time pcs stonith fence --debug node2
Running: /usr/sbin/stonith_admin -B node2
Return Value: 194
--Debug Output Start--
--Debug Output End--
```
> Error: unable to fence 'node2'
> real 0m20.588s
user 0m0.088s
sys 0m0.021s
> [root@node1 ~]# ./suspend-resume-slow-sdc-delay-write.sh 19000
++ blockdev --getsize /dev/sdc
+ SIZE=262144
++ lsblk -dn -o MAJ:MIN /dev/sdc
+ MAJMIN=' 8:32 '
+ dmsetup suspend slow-sdc
+ dmsetup reload slow-sdc --table '0 262144 delay /dev/sdc 0 0 /dev/sdc 0
19000'
+ dmsetup resume slow-sdc
+ dmsetup table slow-sdc
> 0 262144 delay 8:32 0 0 8:32 0 19000
[root@node1 ~]# pcs stonith history cleanup; pcs stonith cleanup #
pcs-cleanup-error-cleanup
cleaning up fencing-history for node *
Cleaned up all resources on all nodes
[root@node1 ~]#
[root@node1 ~]# time pcs stonith fence --debug node2
Running: /usr/sbin/stonith_admin -B node2
Return Value: 0
--Debug Output Start--
--Debug Output End--
> Node: node2 fenced
> real 0m19.869s
user 0m0.098s
sys 0m0.035s
[root@node1 ~]#
```
So here is my question:
I assume there is a 20sec timeout value hardcoded somewhere in pacemaker 1.1.23
sources.
This hardcoded value impacts manual fencing in case of disc I/O delays(maybe in
some other cases).
I expect that increasing timeout can mitigate clusters with disc I/O issues
similar to ones described above.
Please note this timeout is NOT: stonith-timeout or stonith-watchdog-timeout.
Could you please comment if that is a meaningfull assumption and where does the
20sec timeout come from.
Regards, Dmytro
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/