There is already a bug report with Linbit/drbd on github. Issue #26 titled "Bug in drbd 9.1.5 on CentOS 7 #26" from Feb. 2022. I added an update to that issue noting that it persists in 9.1.12 and giving device info.

On 8/19/22 04:14, Christoph Böhmwalder wrote:
Am 16.08.22 um 20:30 schrieb Brent Jensen:
I just had my second DRBD cluster fail after updating
kmod-drbd90-9.1.8-1 and then upgrading the kernel. I'm not sure if the
kernel update broke things or if it was because it caused after the
reboot. About 2 weeks ago there was an update (kmod-drbd90-9.1.8-1) from
elrepo, which got applied. But then after a kernel update the DRBD meta
data was corrupt. Here's the gist of the error:

This is using alma-linux 8:

Aug  7 16:41:13 nfs6 kernel: drbd r0: Starting worker thread (from
drbdsetup [3515])
Aug  7 16:41:13 nfs6 kernel: drbd r0 nfs5: Starting sender thread (from
drbdsetup [3519])
Aug  7 16:41:13 nfs6 kernel: drbd r0/0 drbd0: meta-data IO uses: blk-bio
Aug  7 16:41:13 nfs6 kernel: attempt to access beyond end of
device#012sdb1: rw=6144, want=31250710528, limit=31250706432
Aug  7 16:41:13 nfs6 kernel: drbd r0/0 drbd0:
drbd_md_sync_page_io(,31250710520s,READ) failed with error -5
Aug  7 16:41:13 nfs6 kernel: drbd r0/0 drbd0: Error while reading metadata.

This is from a centos 7 cluster:
Aug 16 11:04:57 v4 kernel: drbd r0 v3: Starting sender thread (from
drbdsetup [9486])
Aug 16 11:04:57 v4 kernel: drbd r0/0 drbd0: meta-data IO uses: blk-bio
Aug 16 11:04:57 v4 kernel: attempt to access beyond end of device
Aug 16 11:04:57 v4 kernel: sdb1: rw=1072, want=3905945600, limit=3905943552
Aug 16 11:04:57 v4 kernel: drbd r0/0 drbd0:
drbd_md_sync_page_io(,3905945592s,READ) failed with error -5
Aug 16 11:04:57 v4 kernel: drbd r0/0 drbd0: Error while reading metadata.
Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Called drbdadm -c
/etc/drbd.conf -v adjust r0
Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Exit code 1
Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Command output:
drbdsetup new-peer r0 0 --_name=v3 --fencing=resource-only
--protocol=C#012drbdsetup new-path r0 0 ipv4:10.1.4.82:7788
ipv4:10.1.4.81:7788#012drbdmeta 0 v09 /dev/sdb1 internal
apply-al#012drbdsetup attach 0 /dev/sdb1 /dev/sdb1 internal
Aug 16 11:04:57 v4 drbd(drbd0)[9452]: ERROR: r0: Command stderr: 0:
Failure: (118) IO error(s) occurred during initial access to
meta-data.#012#012additional info from kernel:#012Error while reading
metadata.#012#012Command 'drbdsetup attach 0 /dev/sdb1 /dev/sdb1
internal' terminated with exit code 10

Both clusters have been running flawlessly for ~2 years. I was in
process of building a new DRBD custer to offload the first one when the
2nd production cluster had a kernel update and ran into the same exact
issue. On the first cluster (rhel8/alma) I deleted the metadata and
tried to resync the data over; however, it failed with the same issue.
I'm in processes of building a new one to fix that broken DRBD cluster.
In the last 15 years of using DRBD I have never run into any corruption
issues. I'm at a loss; I thought the first one was a fluke; now I know
it's not!

_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user
Hello,

thank you for the report.

We have implemented a fix for this[0] which will be released soon (i.e.
very likely within the next week).

If you easily can (and if this is a non-production system), it would be
great if you could build DRBD from that commit and verify that the fix
resolves the issue for you.

If not, the obvious workaround is to stay on 9.1.7 for now (or downgrade).

[0]
https://github.com/LINBIT/drbd/commit/d7d76aad2b95dee098d6052567aa15d1342b1bc4

_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to