Hi everybody!
I have been used DRBD for many years. Thanks to developers for such a great tool.
But now I have a problem: drbd device is blocking 100% IO for a very long period.
My setup is:
2 nodes (actually 3, but for problem disk it's not used), Linstor-managed.
The problem VM is 1 DRBD resource with 2 volumes.
I already had lvm for my VMs, so for DRBD pool I've created a LV in the same VG - vg_system/lv_drbdpool. So LV for DRBD resource is LV over LV: LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
vm-apb-oper_00 drbdpool -wi-ao---- 50.05g
vm-apb-oper_01 drbdpool -wi-ao---- 200.19g
lv_drbdpool vg_system -wi-ao---- 700.00g
11:17:48 AM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
11:17:49 AM dev8-0 18.00 0.00 296.00 16.44 0.00 0.11 0.11 0.20
11:17:49 AM dev147-100 9.00 0.00 40.00 4.44 0.00 0.00 0.00 0.00
11:17:49 AM dev147-101 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:17:49 AM dev147-105 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:17:49 AM dev147-102 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:17:49 AM dev147-1005 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:17:49 AM dev147-114 0.00 0.00 0.00 0.00 9.00 0.00 0.00 100.00
11:17:49 AM dev147-112 8.00 0.00 32.00 4.00 0.00 0.25 0.25 0.20
11:17:49 AM dev147-115 0.00 0.00 0.00 0.00 128.00 0.00 0.00 100.00
11:17:49 AM dev147-113 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:17:49 AM dev147-109 1.00 0.00 80.00 80.00 0.00 0.00 0.00 0.00
11:17:49 AM dev147-110 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:17:49 AM dev147-111 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:17:49 AM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
11:17:50 AM dev8-0 12.00 0.00 144.00 12.00 0.00 0.08 0.08 0.10
11:17:50 AM dev147-100 16.00 0.00 80.00 5.00 0.00 0.00 0.00 0.00
11:17:50 AM dev147-101 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:17:50 AM dev147-105 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:17:50 AM dev147-102 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:17:50 AM dev147-1005 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:17:50 AM dev147-114 0.00 0.00 0.00 0.00 9.00 0.00 0.00 100.00
11:17:50 AM dev147-112 8.00 0.00 47.00 5.88 0.00 0.12 0.12 0.10
11:17:50 AM dev147-115 0.00 0.00 0.00 0.00 128.00 0.00 0.00 100.00
11:17:50 AM dev147-113 5.00 0.00 17.00 3.40 0.00 0.00 0.00 0.00
11:17:50 AM dev147-109 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:17:50 AM dev147-110 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:17:50 AM dev147-111 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
When I try to disconnect resource, the operation is time-outed:
# drbdadm disconnect vm-apb-oper
Command 'drbdsetup disconnect vm-apb-oper 1' did not terminate within 5 seconds
Log shows:
Apr 15 11:28:16 hyper1 kernel: drbd vm-apb-oper hyper2: [drbd_s_vm-apb-o/7524] sending time expired, ko = 4294967216
Apr 15 11:28:21 hyper1 kernel: drbd vm-apb-oper hyper2: Ignoring P_TWOPC_ABORT packet 365981593.
Apr 15 11:28:21 hyper1 kernel: drbd vm-apb-oper hyper2: Rejecting concurrent remote state change 1329964647 because of state change 161760079
Apr 15 11:28:22 hyper1 kernel: drbd vm-apb-oper hyper2: [drbd_s_vm-apb-o/7524] sending time expired, ko = 4294967215
Apr 15 11:28:28 hyper1 kernel: drbd vm-apb-oper hyper2: [drbd_s_vm-apb-o/7524] sending time expired, ko = 4294967214
Apr 15 11:28:32 hyper1 kernel: drbd vm-apb-oper hyper2: Ignoring P_TWOPC_ABORT packet 1329964647.
Apr 15 11:28:32 hyper1 kernel: drbd vm-apb-oper hyper2: Rejecting concurrent remote state change 2449354269 because of state change 161760079
# modinfo drbd
filename: /lib/modules/3.10.0-957.5.1.el7.x86_64/weak-updates/drbd90/drbd.ko
alias: block-major-147-*
license: GPL
version: 9.0.16-1
description: drbd - Distributed Replicated Block Device v9.0.16-1
author: Philipp Reisner <p...@linbit.com>, Lars Ellenberg <l...@linbit.com>
retpoline: Y
rhelversion: 7.6
setup for resource:
linstor resource-definition drbd-options --max-buffers 8000 --max-epoch-size 8000 --sndbuf-size 0 --congestion-fill 1048576 --congestion-extents 16000 --c-fill-target 1048576 --c-max-rate 16384 --ko-count 200 --read-balancing least-pending --verify-alg sha1 --unset-disk-barrier --unset-disk-flushes --unset-md-flushes --unset-disk-drain --allow-two-primaries no vm-apb-oper
I've tried different congestion-* and other params for resource but it's no matter.
If I disconnect the resource, VM functions flawlessly.
How can I fix the problem?
_______________________________________________ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user