Re: [DRBD-user] drbd syncing all resources hangs business app
Hi Roland, Thank you for your answer and time. I've been looking for the auto-resync-after-disable parameter without any luck. I'm using this command to search for it `linstor controller drbd-options -h`. So it seems that I'll have to build a resync chain since my physical resources aren't able to handle the full sync without affecting the business app. Thanks and regards, Ferran Missatge de Roland Kammerer del dia dv., 15 de set. 2023 a les 10:35: > On Fri, Sep 15, 2023 at 09:53:44AM +0200, Ferran Alchimia wrote: > > We have 39 defined resoruces using the same settings. And all these > > resources are running on the same RAID supported by two physical nvme ssd > > drives. > > We have two combined hosts and a diskless satellite host. The network > card > > between the two hosts is a 1Gb card. > > > > I have read the following guide > > https://kb.linbit.com/tuning-drbds-resync-controller and I think our > > current installation might have to be tuned in order to avoid those > > application hungs. > > yes, obviously it is a good idea to check these settings, some defaults > are a bit outdated for modern systems. > > In general there is also a "resync-after" setting where you can build a > "resync chain" of resources, so that not all resources start syncing at > the same time, but after each other in the defined order. > > Your setup looks like a Proxmox setup so you use LINSTOR, and it has a > "DrbdOptions/auto-resync-after-disable" option. You might want to try if > that helps and is good enough for you. disclaimer: I know it was enabled > by default in a release and pulled back afterwards, but things might > have improved enough, at least in such medium sized scenarios. > > Regards, rck > ___ > Star us on GITHUB: https://github.com/LINBIT > drbd-user mailing list > drbd-user@lists.linbit.com > https://lists.linbit.com/mailman/listinfo/drbd-user > ___ Star us on GITHUB: https://github.com/LINBIT drbd-user mailing list drbd-user@lists.linbit.com https://lists.linbit.com/mailman/listinfo/drbd-user
Re: [DRBD-user] drbd syncing all resources hangs business app
On Fri, Sep 15, 2023 at 09:53:44AM +0200, Ferran Alchimia wrote: > We have 39 defined resoruces using the same settings. And all these > resources are running on the same RAID supported by two physical nvme ssd > drives. > We have two combined hosts and a diskless satellite host. The network card > between the two hosts is a 1Gb card. > > I have read the following guide > https://kb.linbit.com/tuning-drbds-resync-controller and I think our > current installation might have to be tuned in order to avoid those > application hungs. yes, obviously it is a good idea to check these settings, some defaults are a bit outdated for modern systems. In general there is also a "resync-after" setting where you can build a "resync chain" of resources, so that not all resources start syncing at the same time, but after each other in the defined order. Your setup looks like a Proxmox setup so you use LINSTOR, and it has a "DrbdOptions/auto-resync-after-disable" option. You might want to try if that helps and is good enough for you. disclaimer: I know it was enabled by default in a release and pulled back afterwards, but things might have improved enough, at least in such medium sized scenarios. Regards, rck ___ Star us on GITHUB: https://github.com/LINBIT drbd-user mailing list drbd-user@lists.linbit.com https://lists.linbit.com/mailman/listinfo/drbd-user
[DRBD-user] drbd syncing all resources hangs business app
Good Morning, When drbd syncs a few resources everything works fine. But when drbd needs to sync all resources (i.e. a host came back up) it hangs the business app running above. All our configuration drbd settings are default, this is a resource sample: resource "vm-100-disk-3" { options { cpu-mask""; # default on-no-data-accessible io-error; # default auto-promoteyes; # default peer-ack-window 4096s; # bytes, default peer-ack-delay 100; # milliseconds, default twopc-timeout 300; # 1/10 seconds, default twopc-retry-timeout 1; # 1/10 seconds, default auto-promote-timeout20; # 1/10 seconds, default max-io-depth8000; # default quorum majority; on-no-quorumio-error; quorum-minimum-redundancy off; # default on-suspended-primary-outdated disconnect; # default } _this_host { node-id 0; volume 0 { device minor 1017; disk"/dev/vgthc1/vm-100-disk-3_0"; meta-disk internal; disk { size0s; # bytes, default on-io-error detach; # default disk-barrierno; # default disk-flushesyes; # default disk-drain yes; # default md-flushes yes; # default resync-after-1; # default al-extents 1237; # default al-updates yes; # default discard-zeroes-if-aligned yes; # default disable-write-same no; # default disk-timeout0; # 1/10 seconds, default read-balancing prefer-local; # default rs-discard-granularity 1048576; # bytes } } } connection { _peer_node_id 2; path { _this_host ipv4 10.0.7.106:7017; _remote_host ipv4 10.100.1.3:7017; } net { transport ""; # default protocolC; # default timeout 60; # 1/10 seconds, default max-epoch-size 2048; # default connect-int 10; # seconds, default ping-int10; # seconds, default sndbuf-size 0; # bytes, default rcvbuf-size 0; # bytes, default ko-count7; # default allow-two-primaries no; # default cram-hmac-alg "sha1"; shared-secret "*"; after-sb-0pri disconnect; # default after-sb-1pri disconnect; # default after-sb-2pri disconnect; # default always-asbp no; # default rr-conflict disconnect; # default ping-timeout5; # 1/10 seconds, default data-integrity-alg ""; # default tcp-corkyes; # default on-congestion block; # default congestion-fill 0s; # bytes, default congestion-extents 1237; # default csums-alg ""; # default csums-after-crash-only no; # default verify-alg "crct10dif-pclmul"; use-rle yes; # default socket-check-timeout0; # default fencing dont-care; # default max-buffers 2048; # default allow-remote-read yes; # default _name "C"; } volume 0 { disk { resync-rate 250k; # bytes/second, default c-plan-ahead20; # 1/10 seconds, default c-delay-target 10; # 1/10 seconds, default c-fill-target 100s; # bytes, default c-max-rate 102400k; # bytes/second, default c-min-rate 250k; # bytes/second, default bitmap no; } } } connection { _peer_node_id 1; path { _this_host ipv4 10.0.7.106:7017; _remote_host ipv4 10.0.7.105:7017; } net { transport ""; # default protocolC; # default timeout 60; # 1/10 seconds, default max-epoch-size 2048; # default connect-int 10; # seconds, default ping-int10; # seconds, default sndbuf-size 0; # bytes, default rcvbuf-size 0; # bytes, default ko-count