I recently also ran into this issue with a couple nodes loses network connectivity and showing various messages like these. "NMI watchdog: Watchdog detected hard" "watchdog: BUG: soft lockup - CPU#xx stuck for 22s!"
I don't know how to do all the stack trace stuff, but I was running these versions. root@vmhost1:~# pveversion -v proxmox-ve: 5.3-1 (running kernel: 4.15.18-12-pve) pve-manager: 5.3-12 (running version: 5.3-12/5fbbbaf6) pve-kernel-4.15: 5.3-3 pve-kernel-4.15.18-12-pve: 4.15.18-35 pve-kernel-4.15.18-11-pve: 4.15.18-34 pve-kernel-4.15.18-10-pve: 4.15.18-32 corosync: 2.4.4-pve1 criu: 2.11.1-1~bpo90 glusterfs-client: 3.8.8-1 ksm-control-daemon: 1.2-2 libjs-extjs: 6.0.1-2 libpve-access-control: 5.1-3 libpve-apiclient-perl: 2.0-5 libpve-common-perl: 5.0-48 libpve-guest-common-perl: 2.0-20 libpve-http-server-perl: 2.0-12 libpve-storage-perl: 5.0-39 libqb0: 1.0.3-1~bpo9 lvm2: 2.02.168-pve6 lxc-pve: 3.1.0-3 lxcfs: 3.0.3-pve1 novnc-pve: 1.0.0-3 proxmox-widget-toolkit: 1.0-24 pve-cluster: 5.0-34 pve-container: 2.0-35 pve-docs: 5.3-3 pve-edk2-firmware: 1.20190312-1 pve-firewall: 3.0-18 pve-firmware: 2.0-6 pve-ha-manager: 2.0-8 pve-i18n: 1.0-9 pve-libspice-server1: 0.14.1-2 pve-qemu-kvm: 2.12.1-2 pve-xtermjs: 3.10.1-2 qemu-server: 5.0-47 smartmontools: 6.5+svn4324-1 spiceterm: 3.0-5 vncterm: 1.5-3 zfsutils-linux: 0.7.13-pve1~bpo2 and drbd-dkms/unknown,now 9.0.17-1 drbd-utils/unknown,now 9.8.0-1 linstor-common/unknown,now 0.9.4-1 linstor-proxmox/unknown,now 3.0.3-1 linstor-satellite/unknown,now 0.9.4-1 python-linstor/unknown,now 0.9.1-1 One of my nodes would lockup shortly after it would start synchronizing and did so repeatedly. After I downgraded drbd-dkms to 9.0.14-1 my cluster is stable again. Just an FYI, Thanks _______________________________________________ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user