[ClusterLabs] Q: sbd: Which parameter controls "error: servant_md: slot read failed in servant."?

Ulrich Windl Wed, 16 Feb 2022 06:09:00 -0800

Hi!

When changing some FC cables I noticed that sbd complained 2 seconds after the 
connection went down (event though the device is multi-pathed with other paths 
being still up).
I don't know any sbd parameter being set so low that after 2 seconds sbd would 
panic. Which parameter (if any) is responsible for that?


In fact multipath takes up to 5 seconds to adjust paths.

Here are some sample events (sbd-1.5.0+20210720.f4ca41f-3.6.1.x86_64 from 
SLES15 SP3):
Feb 14 13:01:36 h18 kernel: qla2xxx [0000:41:00.0]-500b:3: LOOP DOWN detected 
(2 7 0 0).
Feb 14 13:01:38 h18 sbd[6621]: /dev/disk/by-id/dm-name-SBD_1-3P2:    error: 
servant_md: slot read failed in servant.
Feb 14 13:01:38 h18 sbd[6619]: /dev/disk/by-id/dm-name-SBD_1-3P1:    error: 
servant_md: mbox read failed in servant.
Feb 14 13:01:40 h18 sbd[6615]:  warning: inquisitor_child: Servant 
/dev/disk/by-id/dm-name-SBD_1-3P1 is outdated (age: 11)
Feb 14 13:01:40 h18 sbd[6615]:  warning: inquisitor_child: Servant 
/dev/disk/by-id/dm-name-SBD_1-3P2 is outdated (age: 11)
Feb 14 13:01:40 h18 sbd[6615]:  warning: inquisitor_child: Majority of devices 
lost - surviving on pacemaker
Feb 14 13:01:42 h18 kernel: sd 3:0:3:2: rejecting I/O to offline device
Feb 14 13:01:42 h18 kernel: blk_update_request: I/O error, dev sdbt, sector 
2048 op 0x0:(READ) flags 0x4200 phys_seg 1 prio class 1
Feb 14 13:01:42 h18 kernel: device-mapper: multipath: 254:17: Failing path 
68:112.
Feb 14 13:01:42 h18 kernel: sd 3:0:1:2: rejecting I/O to offline device

Most puzzling is the fact that sbd reports a problem 4 seconds before the 
kernel reports an I/O error. I guess sbd "times out" the pending read.

The thing is: Both SBD disks are on different storage systems, each being 
connected by two separate FC fabrics, but still when disconnecting one cable 
from the host sbd panics.
My guess is if "surviving on pacemaker" would not have happened, the node would 
be fenced; is that right?

The other thing I wonder is the "outdated age":
How can the age be 11 (seconds) when the disk was disconnected 4 seconds ago?
It seems here the age is "current time - time_of_last read" instead of 
"current_time - time_when read_attempt_started".

Regards,
Ulrich




_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Q: sbd: Which parameter controls "error: servant_md: slot read failed in servant."?

Reply via email to