Re: [DRBD-user] drbd syncing all resources hangs business app

2023-09-15 Thread Ferran Alchimia
Hi Roland,

Thank you for your answer and time. I've been looking for the
auto-resync-after-disable parameter without any luck. I'm using this
command to search for it `linstor controller drbd-options -h`.
So it seems that I'll have to build a resync chain since my
physical resources aren't able to handle the full sync without affecting
the business app.

Thanks and regards,
Ferran

Missatge de Roland Kammerer  del dia dv., 15 de
set. 2023 a les 10:35:

> On Fri, Sep 15, 2023 at 09:53:44AM +0200, Ferran Alchimia wrote:
> > We have 39 defined resoruces using the same settings. And all these
> > resources are running on the same RAID supported by two physical nvme ssd
> > drives.
> > We have two combined hosts and a diskless satellite host. The network
> card
> > between the two hosts is a 1Gb card.
> >
> > I have read the following guide
> > https://kb.linbit.com/tuning-drbds-resync-controller and I think our
> > current installation might have to be tuned in order to avoid those
> > application hungs.
>
> yes, obviously it is a good idea to check these settings, some defaults
> are a bit outdated for modern systems.
>
> In general there is also a "resync-after" setting where you can build a
> "resync chain" of resources, so that not all resources start syncing at
> the same time, but after each other in the defined order.
>
> Your setup looks like a Proxmox setup so you use LINSTOR, and it has a
> "DrbdOptions/auto-resync-after-disable" option. You might want to try if
> that helps and is good enough for you. disclaimer: I know it was enabled
> by default in a release and pulled back afterwards, but things might
> have improved enough, at least in such medium sized scenarios.
>
> Regards, rck
> ___
> Star us on GITHUB: https://github.com/LINBIT
> drbd-user mailing list
> drbd-user@lists.linbit.com
> https://lists.linbit.com/mailman/listinfo/drbd-user
>
___
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] drbd syncing all resources hangs business app

2023-09-15 Thread Roland Kammerer
On Fri, Sep 15, 2023 at 09:53:44AM +0200, Ferran Alchimia wrote:
> We have 39 defined resoruces using the same settings. And all these
> resources are running on the same RAID supported by two physical nvme ssd
> drives.
> We have two combined hosts and a diskless satellite host. The network card
> between the two hosts is a 1Gb card.
> 
> I have read the following guide
> https://kb.linbit.com/tuning-drbds-resync-controller and I think our
> current installation might have to be tuned in order to avoid those
> application hungs.

yes, obviously it is a good idea to check these settings, some defaults
are a bit outdated for modern systems.

In general there is also a "resync-after" setting where you can build a
"resync chain" of resources, so that not all resources start syncing at
the same time, but after each other in the defined order.

Your setup looks like a Proxmox setup so you use LINSTOR, and it has a
"DrbdOptions/auto-resync-after-disable" option. You might want to try if
that helps and is good enough for you. disclaimer: I know it was enabled
by default in a release and pulled back afterwards, but things might
have improved enough, at least in such medium sized scenarios.

Regards, rck
___
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user


[DRBD-user] drbd syncing all resources hangs business app

2023-09-15 Thread Ferran Alchimia
Good Morning,

When drbd syncs a few resources everything works fine. But when drbd needs
to sync all resources (i.e. a host came back up) it hangs the business app
running above.

All our configuration drbd settings are default, this is a resource sample:

resource "vm-100-disk-3" {
options {
cpu-mask""; # default
on-no-data-accessible   io-error; # default
auto-promoteyes; # default
peer-ack-window 4096s; # bytes, default
peer-ack-delay  100; # milliseconds, default
twopc-timeout   300; # 1/10 seconds, default
twopc-retry-timeout 1; # 1/10 seconds, default
auto-promote-timeout20; # 1/10 seconds, default
max-io-depth8000; # default
quorum  majority;
on-no-quorumio-error;
quorum-minimum-redundancy   off; # default
on-suspended-primary-outdated   disconnect; # default
}
_this_host {
node-id 0;
volume 0 {
device  minor 1017;
disk"/dev/vgthc1/vm-100-disk-3_0";
meta-disk   internal;
disk {
size0s; # bytes, default
on-io-error detach; # default
disk-barrierno; # default
disk-flushesyes; # default
disk-drain  yes; # default
md-flushes  yes; # default
resync-after-1; # default
al-extents  1237; # default
al-updates  yes; # default
discard-zeroes-if-aligned   yes; # default
disable-write-same  no; # default
disk-timeout0; # 1/10 seconds, default
read-balancing  prefer-local; # default
rs-discard-granularity  1048576; # bytes
}
}
}
connection {
_peer_node_id 2;
path {
_this_host ipv4 10.0.7.106:7017;
_remote_host ipv4 10.100.1.3:7017;
}
net {
transport   ""; # default
protocolC; # default
timeout 60; # 1/10 seconds, default
max-epoch-size  2048; # default
connect-int 10; # seconds, default
ping-int10; # seconds, default
sndbuf-size 0; # bytes, default
rcvbuf-size 0; # bytes, default
ko-count7; # default
allow-two-primaries no; # default
cram-hmac-alg   "sha1";
shared-secret   "*";
after-sb-0pri   disconnect; # default
after-sb-1pri   disconnect; # default
after-sb-2pri   disconnect; # default
always-asbp no; # default
rr-conflict disconnect; # default
ping-timeout5; # 1/10 seconds, default
data-integrity-alg  ""; # default
tcp-corkyes; # default
on-congestion   block; # default
congestion-fill 0s; # bytes, default
congestion-extents  1237; # default
csums-alg   ""; # default
csums-after-crash-only  no; # default
verify-alg  "crct10dif-pclmul";
use-rle yes; # default
socket-check-timeout0; # default
fencing dont-care; # default
max-buffers 2048; # default
allow-remote-read   yes; # default
_name   "C";
}
volume 0 {
disk {
resync-rate 250k; # bytes/second, default
c-plan-ahead20; # 1/10 seconds, default
c-delay-target  10; # 1/10 seconds, default
c-fill-target   100s; # bytes, default
c-max-rate  102400k; # bytes/second, default
c-min-rate  250k; # bytes/second, default
bitmap  no;
}
}
}
connection {
_peer_node_id 1;
path {
_this_host ipv4 10.0.7.106:7017;
_remote_host ipv4 10.0.7.105:7017;
}
net {
transport   ""; # default
protocolC; # default
timeout 60; # 1/10 seconds, default
max-epoch-size  2048; # default
connect-int 10; # seconds, default
ping-int10; # seconds, default
sndbuf-size 0; # bytes, default
rcvbuf-size 0; # bytes, default
ko-count