Hi everybody!
I have been used DRBD for many years. Thanks to developers for such a great tool.
But now I have a problem: drbd device is blocking 100% IO for a very long period.
My setup is:
2 nodes (actually 3, but for problem disk it's not used), Linstor-managed.
The problem VM is 1 DRBD resource with 2 volumes.
I already had lvm for my VMs, so for DRBD pool I've created a LV in the same VG - vg_system/lv_drbdpool. So LV for DRBD resource is LV over LV:
  LV                VG        Attr       LSize   Pool Origin   Data%  Meta%  Move Log Cpy%Sync Convert
  vm-apb-oper_00    drbdpool  -wi-ao----  50.05g
  vm-apb-oper_01    drbdpool  -wi-ao---- 200.19g
  lv_drbdpool       vg_system -wi-ao---- 700.00g
 
When I try to sync, VM's IO is hanging up (sometimes for minutes!), although underlying storage device is mostly idle.
11:17:48 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
11:17:49 AM    dev8-0     18.00      0.00    296.00     16.44      0.00      0.11      0.11      0.20
11:17:49 AM dev147-100      9.00      0.00     40.00      4.44      0.00      0.00      0.00      0.00
11:17:49 AM dev147-101      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
11:17:49 AM dev147-105      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
11:17:49 AM dev147-102      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
11:17:49 AM dev147-1005      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
11:17:49 AM dev147-114      0.00      0.00      0.00      0.00      9.00      0.00      0.00    100.00
11:17:49 AM dev147-112      8.00      0.00     32.00      4.00      0.00      0.25      0.25      0.20
11:17:49 AM dev147-115      0.00      0.00      0.00      0.00    128.00      0.00      0.00    100.00
11:17:49 AM dev147-113      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
11:17:49 AM dev147-109      1.00      0.00     80.00     80.00      0.00      0.00      0.00      0.00
11:17:49 AM dev147-110      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
11:17:49 AM dev147-111      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
11:17:49 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
11:17:50 AM    dev8-0     12.00      0.00    144.00     12.00      0.00      0.08      0.08      0.10
11:17:50 AM dev147-100     16.00      0.00     80.00      5.00      0.00      0.00      0.00      0.00
11:17:50 AM dev147-101      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
11:17:50 AM dev147-105      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
11:17:50 AM dev147-102      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
11:17:50 AM dev147-1005      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
11:17:50 AM dev147-114      0.00      0.00      0.00      0.00      9.00      0.00      0.00    100.00
11:17:50 AM dev147-112      8.00      0.00     47.00      5.88      0.00      0.12      0.12      0.10
11:17:50 AM dev147-115      0.00      0.00      0.00      0.00    128.00      0.00      0.00    100.00
11:17:50 AM dev147-113      5.00      0.00     17.00      3.40      0.00      0.00      0.00      0.00
11:17:50 AM dev147-109      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
11:17:50 AM dev147-110      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
11:17:50 AM dev147-111      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
 
When I try to disconnect resource, the operation is time-outed:
# drbdadm disconnect vm-apb-oper
Command 'drbdsetup disconnect vm-apb-oper 1' did not terminate within 5 seconds
 
Log shows:
Apr 15 11:28:16 hyper1 kernel: drbd vm-apb-oper hyper2: [drbd_s_vm-apb-o/7524] sending time expired, ko = 4294967216
Apr 15 11:28:21 hyper1 kernel: drbd vm-apb-oper hyper2: Ignoring P_TWOPC_ABORT packet 365981593.
Apr 15 11:28:21 hyper1 kernel: drbd vm-apb-oper hyper2: Rejecting concurrent remote state change 1329964647 because of state change 161760079
Apr 15 11:28:22 hyper1 kernel: drbd vm-apb-oper hyper2: [drbd_s_vm-apb-o/7524] sending time expired, ko = 4294967215
Apr 15 11:28:28 hyper1 kernel: drbd vm-apb-oper hyper2: [drbd_s_vm-apb-o/7524] sending time expired, ko = 4294967214
Apr 15 11:28:32 hyper1 kernel: drbd vm-apb-oper hyper2: Ignoring P_TWOPC_ABORT packet 1329964647.
Apr 15 11:28:32 hyper1 kernel: drbd vm-apb-oper hyper2: Rejecting concurrent remote state change 2449354269 because of state change 161760079
 
# modinfo drbd
filename:       /lib/modules/3.10.0-957.5.1.el7.x86_64/weak-updates/drbd90/drbd.ko
alias:          block-major-147-*
license:        GPL
version:        9.0.16-1
description:    drbd - Distributed Replicated Block Device v9.0.16-1
author:         Philipp Reisner <p...@linbit.com>, Lars Ellenberg <l...@linbit.com>
retpoline:      Y
rhelversion:    7.6
 
setup for resource:
linstor resource-definition drbd-options --max-buffers 8000 --max-epoch-size 8000 --sndbuf-size 0 --congestion-fill 1048576 --congestion-extents 16000 --c-fill-target 1048576 --c-max-rate 16384 --ko-count 200 --read-balancing least-pending --verify-alg sha1 --unset-disk-barrier --unset-disk-flushes --unset-md-flushes --unset-disk-drain --allow-two-primaries no vm-apb-oper
 
I've tried different congestion-* and other params for resource but it's no matter.
If I disconnect the resource, VM functions flawlessly.
How can I fix the problem?
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to