Dear DRBD-users,

we are currently performing an upgrade from proxmox ve-6 to ve-7 on a three-node linstor/drbd cluster. (Only two nodes are storage+compute nodes / satellites, third is linstor-controller+quorum node)

This is a testing environment that we built in preparation for the upgrade of the live cluster.

Before starting the upgrade we were on linstor 1.11, drbd-dkms 9.0.27 and pve 6.3. Our upgrade route was to first upgrade linstor to 1.20, then upgrade all nodes to pve 6.4 and drbd-9.2 (9.0.27-1 -> 9.2.0-1)

After a fresh boot all nodes we were in a good state. Healthy cluster, pve6to7 happy, drbd in sync and all packages up-to-date.

We then performed the upgrade of the first node to pve-7 which seemed to go well and rebooted the first node into pve-7.2-11) As we have three active VMs with three disk resources this triggered a drbd resync.

Two resources came out fine:

drbd1000 Testserver1: Resync done (total 2 sec; paused 0 sec; 104448 K/sec)
drbd1002 Testserver1: Resync done (total 55 sec; paused 0 sec; 92120 K/sec)

The third resource however did sync about 65% of the outdated data and then stalled (no more sync traffic, no progress in drbdmon)

The kernel message that seems to be relevant here is this:

drbd vm-101-disk-1/0 drbd1001: drbd_set_in_sync: sector=73703424s size=134479872 nonsense!

More kernel logs from the pve7 node can be found here
https://pastebin.com/aGjy7Sgp

So far we have tried to reboot the pve7 node, but it will always get stuck in inconsistent/synctarget (no percentage of progress shown) and print the kernel error message "drbd_set_in_sync: sector=73703424s size=134479872 nonsense".

The linstor resources are backed by lvm_thin which is backed by a MegaRAID in RAID1 with SSD drives.

I don't know if this is relevant, but the VM in question has at some point in its lifetime been rolled back to a snapshot. (All snapshots have been removed prior to the upgrades).

At that time the rollback did work OK, but we noticed a huge increase of the allocated space on the backing device (IIRC it was equal to the virtual disk size). We have set "discard=on" in proxmox and did a "fstrim" in the VM, which cut down the space usage, but it's not equal on both nodes):

root@Testserver3:~# linstor resource list-volumes
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node ┊ Resource ┊ StoragePool ┊ VolNr ┊ MinorNr ┊ DeviceName ┊ Allocated ┊ InUse ┊ State ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ Testserver1 ┊ vm-100-disk-1 ┊ ssd_thin ┊ 0 ┊ 1000 ┊ /dev/drbd1000 ┊ 2.28 GiB ┊ InUse ┊ UpToDate ┊ ┊ Testserver2 ┊ vm-100-disk-1 ┊ ssd_thin ┊ 0 ┊ 1000 ┊ /dev/drbd1000 ┊ 2.50 GiB ┊ Unused ┊ UpToDate ┊ ┊ Testserver1 ┊ vm-101-disk-1 ┊ ssd_thin ┊ 0 ┊ 1001 ┊ /dev/drbd1001 ┊ 35.38 GiB ┊ InUse ┊ UpToDate ┊ ┊ Testserver2 ┊ vm-101-disk-1 ┊ ssd_thin ┊ 0 ┊ 1001 ┊ /dev/drbd1001 ┊ 31.05 GiB ┊ Unused ┊ Inconsistent ┊ ┊ Testserver1 ┊ vm-102-disk-1 ┊ ssd_thin ┊ 0 ┊ 1002 ┊ /dev/drbd1002 ┊ 7.04 GiB ┊ InUse ┊ UpToDate ┊ ┊ Testserver2 ┊ vm-102-disk-1 ┊ ssd_thin ┊ 0 ┊ 1002 ┊ /dev/drbd1002 ┊ 7.04 GiB ┊ Unused ┊ UpToDate ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

The linstor-created resource looks like this:
https://pastebin.com/syLADBdC

relevant version numbers:

drbd-dkms: 9.2.0-1
linstor-(controller|satellite): 1.20.0-1
linstor-proxmox: 6.1.0-1
proxmox-ve versions: 6.4-1 (two nodes) and 7.2-1 (one node)
kernel: 5.4.203-1-pve (two nodes) and 5.15.64-1-pve (one node)

Any insight on this would be most welcome. I'll provide more details if you feel something is missing.

thanks and kind regards,
Nils
_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
[email protected]
https://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to