Hi, I've done some more tests and I am now able to reproduce the 100% CPU usage / infinite loop of corosync/libqb on all the kernels that I tested. Therefore I think this is NOT a regression in 4.19 kernel.
A suitable workaround for 4.19+ kernels is to run this on startup (e.g. from tmpfiles.d): echo NO_RT_RUNTIME_SHARE >/sys/kernel/debug/sched_features This makes the 4.19 kernel behave similarly wrt to earlier ones, where high CPU usage did not prevent softirq processing. The difference between these kernels is whether the 100% CPU usage also causes the host to fence or loose network connectivity: with 4.19.x it does, with 4.4.x, 3.10.x it doesn't. I am looking at a 4.4.x kernel now with 100% CPU usage on 8 out of 16 hosts running since 2h30m now for example. Kernels I've tried: * 3.10.0-957.5.1.e7 (default CentOS 7.6) * kernel-ml-4.20.12-1.el7.elrepo.x86_64 (elrepo) * v4.20.10 (manual build with 'make binpkg-rpm' -j32 with .config from 4.20.10-100.fc28 and DEBUG_INFO disabled. With DEBUG_INFO enabled rpm on CentOS7 failed to install the package) * various attempted bisects: kernel-4.19.16-200.fc28.x86_64.rpm kernel-4.20.0-1.fc30.x86_64.rpm kernel-4.20.5_00059_g7e66208-4.x86_64.rpm kernel-4.20.5_00089_g44486b2-5.x86_64.rpm kernel-4.20.8_00026_g1699a0c-7.x86_64.rpm kernel-4.20.8_00039_gf76ee75-8.x86_64.rpm kernel-4.20.8_00046_g196ffed-9.x86_64.rpm kernel-4.20.8_00049_g54dbbb7-10.x86_64.rpm kernel-4.20.9-11.x86_64.rpm kernel-4.20.9_00001_g90aa9a7-12.x86_64.rpm kernel-4.4.52-4.0.12.x86_64.rpm (XenServer Lima) kernel-4.19.19-5.0.1.x86_64.rpm (XenServer master) The updated repro steps are: On 19/02/2019 16:26, Edwin Török wrote:> On 18/02/2019 18:27, Edwin Török wrote: > Setup: 16 CentOS 7.6 VMs, 4 vCPUs, 4GiB RAM running on XenServer 7.6 > (Xen 4.7.6) 2 vCPUs makes this a lot easier to reproduce the lost network connectivity/fencing. 1 vCPU reproduces just the high CPU usage, but with 95% CPU usage on all kernels, and no fencing at all, also after a while the CPU usage drops because it is able to make progress, so 1 vCPU is not a good reproducer. > Host is a Dell Poweredge R430, Xeon E5-2630 v3. Used exact same host. > > On each VM: > # yum install -y corosync dlm pcs pacemaker fence-agents-all sbd > # echo mypassword | passwd hacluster --stdin > # systemctl enable --now pcsd > # echo xen_wdt >/etc/modules-load.d/watchdog.conf > # modprobe xen_wdt > # hostnamectl set-hostname host-<ip-address> > > On one host: > # pcs cluster auth -u hacluster -p xenroot <allips> > # pcs cluster setup --name cluster --auto_tie_breaker=1 <allips> > > # pcs stonith sbd enable > # pcs cluster enable --all > # pcs cluster start --all > # pcs property set no-quorum-policy=freeze > # pcs resource create dlm ocf:pacemaker:controld op monitor interval=30s > on-fail=fence clone interleave=true ordered=true > # pcs property set stonith-watchdog-timeout=10s I also installed spausedd mentioned earlier, see below for details. > > In a loop on this host: > # while true; do pcs cluster stop; pcs cluster start; corosync-cfgtool > -R; done > This loop is not a good enough reproducer, it led me down some wrong paths during git bisection, mistakenly showing a kernel as not triggering the bug when in fact it was. Sometimes the loop shows an increased CPU usage of ~50% on many nodes, but when the loop is stopped this stops too. Sometimes the loop shows no noticable increase in CPU usage. A better one is: * On all hosts run: pcs cluster stop --force * Wait until they stop (might have to `kill -9 $(pidof corosync)` if any is already stuck * Install new kernel (yum install -y kernel, yum install -y kernel-ml, rpm -Uhv --force kernel-..., etc.) * rm -f /boot/grub2/grubenv * sync * reboot * Run corosync-quorumtool and crm_mon to check node health When all nodes have booted and have corosync running you usually end up in these scenarios: 1. The node takes a long time to boot until you get an SSH or vga/serial console shell. Wait, the boot scripts have to timeout, likely related to corosync failing to start up. 2. All nodes quorate, and crm_mon shows dlm active on all hosts. Bug failed to reproduce, try again with a reboot (either hard reboot, or clean reboot of all hosts in parallel). It is quite rare you end up in this situation. 3. Corosync/crmd both using a lot of CPU. Wait, crmd usually finishes and you end up in one of the other scenarios 4. Corosync using 50% of a core. Wait, it usually either finishes or goes 100%. 5. Corosync is not running, and `pcs cluster start` fails due to 'Denied connection, not ready'. Try 'pcs cluster start' again in a loop until you reach one of the other scenarios 6. Corosync running with high CPU usage on some nodes (100%). If it stays like that for several minutes the bug was reproduced (whether the node fences or not) On some kernels after a hard reboot of all hosts the problem reproduces very easily. On other kernels it takes a few more hard (power cycle the VM), or soft (run 'reboot' command inside the VM) cycles to reproduce. The soft reboot seems to have a higher chance of reproducing the problem. spausedd: The pauses it logs are not always correlated with the high CPU usage. For example on one node it hasn't logged anything today: /var/log/messages-20190224:Feb 20 17:03:17 host-10 spausedd[4239]: Not scheduled for 0.2225s (threshold is 0.2000s), steal time is 0.0000s (0.00%) /var/log/messages-20190224:Feb 20 17:03:19 host-10 spausedd[4239]: Not scheduled for 0.2416s (threshold is 0.2000s), steal time is 0.0000s (0.00% But: 4399 root rt 0 241380 139072 84732 R 106.7 3.4 183:51.67 corosync On other nodes it does log problems, but only every once in a while, for example these are all the logs from today from this other node: /var/log/messages:Feb 25 10:07:45 host-10 spausedd[3707]: Not scheduled for 0.4692s (threshold is 0.2000s), steal time is 0.2800s (59.68%) /var/log/messages:Feb 25 10:07:45 host-10 spausedd[3707]: Steal time is > 10.0%, this is usually because of overloaded host machine /var/log/messages:Feb 25 10:16:34 host-10 spausedd[3707]: Not scheduled for 0.4903s (threshold is 0.2000s), steal time is 0.3400s (69.34%) /var/log/messages:Feb 25 10:16:34 host-10 spausedd[3707]: Steal time is > 10.0%, this is usually because of overloaded host machine /var/log/messages:Feb 25 10:42:21 host-10 spausedd[3720]: Not scheduled for 0.2840s (threshold is 0.2000s), steal time is 0.1600s (56.34%) /var/log/messages:Feb 25 10:42:21 host-10 spausedd[3720]: Steal time is > 10.0%, this is usually because of overloaded host machine /var/log/messages:Feb 25 10:59:00 host-10 spausedd[3729]: Not scheduled for 0.4808s (threshold is 0.2000s), steal time is 0.3700s (76.95%) /var/log/messages:Feb 25 10:59:00 host-10 spausedd[3729]: Steal time is > 10.0%, this is usually because of overloaded host machine /var/log/messages:Feb 25 11:07:53 host-10 spausedd[3729]: Not scheduled for 0.2448s (threshold is 0.2000s), steal time is 0.1300s (53.10%) /var/log/messages:Feb 25 11:07:53 host-10 spausedd[3729]: Steal time is > 10.0%, this is usually because of overloaded host machine Best regards, --Edwin _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org