Edwin
Hi,

I've done some more tests and I am now able to reproduce the 100% CPU
usage / infinite loop of corosync/libqb on all the kernels that I
tested. Therefore I think this is NOT a regression in 4.19 kernel.

That's good


A suitable workaround for 4.19+ kernels is to run this on startup (e.g.
from tmpfiles.d):
echo NO_RT_RUNTIME_SHARE >/sys/kernel/debug/sched_features
This makes the 4.19 kernel behave similarly wrt to earlier ones, where
high CPU usage did not prevent softirq processing.

The difference between these kernels is whether the 100% CPU usage also
causes the host to fence or loose network connectivity: with 4.19.x it
does, with 4.4.x, 3.10.x it doesn't. I am looking at a 4.4.x kernel now
with 100% CPU usage on 8 out of 16 hosts running since 2h30m now for
example.

Kernels I've tried:
* 3.10.0-957.5.1.e7 (default CentOS 7.6)
* kernel-ml-4.20.12-1.el7.elrepo.x86_64 (elrepo)
* v4.20.10 (manual build with 'make binpkg-rpm' -j32 with .config from
4.20.10-100.fc28 and DEBUG_INFO disabled. With DEBUG_INFO enabled rpm on
CentOS7 failed to install the package)
* various attempted bisects:
kernel-4.19.16-200.fc28.x86_64.rpm
kernel-4.20.0-1.fc30.x86_64.rpm
kernel-4.20.5_00059_g7e66208-4.x86_64.rpm
kernel-4.20.5_00089_g44486b2-5.x86_64.rpm
kernel-4.20.8_00026_g1699a0c-7.x86_64.rpm
kernel-4.20.8_00039_gf76ee75-8.x86_64.rpm
kernel-4.20.8_00046_g196ffed-9.x86_64.rpm
kernel-4.20.8_00049_g54dbbb7-10.x86_64.rpm
kernel-4.20.9-11.x86_64.rpm
kernel-4.20.9_00001_g90aa9a7-12.x86_64.rpm
kernel-4.4.52-4.0.12.x86_64.rpm (XenServer Lima)
kernel-4.19.19-5.0.1.x86_64.rpm (XenServer master)

The updated repro steps are:

On 19/02/2019 16:26, Edwin Török wrote:> On 18/02/2019 18:27, Edwin
Török wrote:
Setup: 16 CentOS 7.6 VMs, 4 vCPUs, 4GiB RAM running on XenServer 7.6
(Xen 4.7.6)

2 vCPUs makes this a lot easier to reproduce the lost network
connectivity/fencing.
1 vCPU reproduces just the high CPU usage, but with 95% CPU usage on all
kernels, and no fencing at all, also after a while the CPU usage drops
because it is able to make progress, so 1 vCPU is not a good reproducer.

Host is a Dell Poweredge R430, Xeon E5-2630 v3.

So do I understand correctly that setup is:
- 10 HW cores
- 16 VMs each with 2 (previously 4) vcpu -> 32 (previously 64) vcpu
?

How much memory (physical memory) hosts has?

Is the problem reproducible when hosts is not that much overcommited (let's say 5-10 VMs each with 2 vcpus)?



Used exact same host.


On each VM:
# yum install -y corosync dlm pcs pacemaker fence-agents-all sbd
# echo mypassword | passwd hacluster --stdin
# systemctl enable --now pcsd
# echo xen_wdt >/etc/modules-load.d/watchdog.conf
# modprobe xen_wdt
# hostnamectl set-hostname host-<ip-address>

On one host:
# pcs cluster auth -u hacluster -p xenroot <allips>
# pcs cluster setup --name cluster --auto_tie_breaker=1 <allips>

# pcs stonith sbd enable
# pcs cluster enable --all
# pcs cluster start --all
# pcs property set no-quorum-policy=freeze
# pcs resource create dlm ocf:pacemaker:controld op monitor interval=30s
on-fail=fence clone interleave=true ordered=true
# pcs property set stonith-watchdog-timeout=10s

I also installed spausedd mentioned earlier, see below for details.


In a loop on this host:
# while true; do pcs cluster stop; pcs cluster start; corosync-cfgtool
-R; done


This loop is not a good enough reproducer, it led me down some wrong
paths during git bisection, mistakenly showing a kernel as not
triggering the bug when in fact it was.
Sometimes the loop shows an increased CPU usage of ~50% on many nodes,
but when the loop is stopped this stops too. Sometimes the loop shows no
noticable increase in CPU usage.

A better one is:
* On all hosts run:
pcs cluster stop --force
* Wait until they stop (might have to `kill -9 $(pidof corosync)` if any
is already stuck
* Install new kernel (yum install -y kernel, yum install -y kernel-ml,
rpm -Uhv --force kernel-..., etc.)
* rm -f /boot/grub2/grubenv
* sync
* reboot
* Run corosync-quorumtool and crm_mon to check node health

When all nodes have booted and have corosync running you usually end up
in these scenarios:
1. The node takes a long time to boot until you get an SSH or vga/serial
console shell. Wait, the boot scripts have to timeout, likely related to
corosync failing to start up.
2. All nodes quorate, and crm_mon shows dlm active on all hosts. Bug
failed to reproduce, try again with a reboot (either hard reboot, or
clean reboot of all hosts in parallel). It is quite rare you end up in
this situation.
3. Corosync/crmd both using a lot of CPU. Wait, crmd usually finishes
and you end up in one of the other scenarios
4. Corosync using 50% of a core. Wait, it usually either finishes or
goes 100%.
5. Corosync is not running, and `pcs cluster start` fails due to 'Denied
connection, not ready'. Try 'pcs cluster start' again in a loop until
you reach one of the other scenarios
6. Corosync running with high CPU usage on some nodes (100%). If it
stays like that for several minutes the bug was reproduced (whether the
node fences or not)

On some kernels after a hard reboot of all hosts the problem reproduces
very easily. On other kernels it takes a few more hard (power cycle the
VM), or soft (run 'reboot' command inside the VM) cycles to reproduce.
The soft reboot seems to have a higher chance of reproducing the problem.

Have you got a chance to test with "send_join" set to 200 in corosync.conf? It may really help, because the problem you are describing really looks like result of all nodes starting at same time, and overloading network at exactly same time.

We already have a bug for bigger clusters (32 nodes) where setting set_join helps and it's highly probably going to become default (I need to test it and find "formula" to compute value based on token timeout).


spausedd:

The pauses it logs are not always correlated with the high CPU usage.
For example on one node it hasn't logged anything today:
/var/log/messages-20190224:Feb 20 17:03:17 host-10 spausedd[4239]: Not
scheduled for 0.2225s (threshold is 0.2000s), steal time is 0.0000s (0.00%)
/var/log/messages-20190224:Feb 20 17:03:19 host-10 spausedd[4239]: Not
scheduled for 0.2416s (threshold is 0.2000s), steal time is 0.0000s (0.00%

But:
4399 root      rt   0  241380 139072  84732 R 106.7  3.4 183:51.67 corosync


High corosync cpu usage doesn't necessarily mean that spausedd doesn't get it's required portion (very small one) of time. You can check yourself the spausedd source code (https://github.com/jfriesse/spausedd/blob/master/spausedd.c) - it's short (specially if you ignore HAVE_VMGUESTLIB parts) and I'm pretty sure it works as should be. (It was one of the reason why I wrote it, because corosync pause detector may be affected by various components bugs (cycle in corosync/bug in libqb), but spausedd shouldn't be)

On other nodes it does log problems, but only every once in a while, for
example these are all the logs from today from this other node:
/var/log/messages:Feb 25 10:07:45 host-10 spausedd[3707]: Not scheduled
for 0.4692s (threshold is 0.2000s), steal time is 0.2800s (59.68%)

/var/log/messages:Feb 25 10:07:45 host-10 spausedd[3707]: Steal time is
10.0%, this is usually because of overloaded host machine

/var/log/messages:Feb 25 10:16:34 host-10 spausedd[3707]: Not scheduled
for 0.4903s (threshold is 0.2000s), steal time is 0.3400s (69.34%)

/var/log/messages:Feb 25 10:16:34 host-10 spausedd[3707]: Steal time is
10.0%, this is usually because of overloaded host machine

/var/log/messages:Feb 25 10:42:21 host-10 spausedd[3720]: Not scheduled
for 0.2840s (threshold is 0.2000s), steal time is 0.1600s (56.34%)

/var/log/messages:Feb 25 10:42:21 host-10 spausedd[3720]: Steal time is
10.0%, this is usually because of overloaded host machine

/var/log/messages:Feb 25 10:59:00 host-10 spausedd[3729]: Not scheduled
for 0.4808s (threshold is 0.2000s), steal time is 0.3700s (76.95%)

/var/log/messages:Feb 25 10:59:00 host-10 spausedd[3729]: Steal time is
10.0%, this is usually because of overloaded host machine

/var/log/messages:Feb 25 11:07:53 host-10 spausedd[3729]: Not scheduled
for 0.2448s (threshold is 0.2000s), steal time is 0.1300s (53.10%)

/var/log/messages:Feb 25 11:07:53 host-10 spausedd[3729]: Steal time is
10.0%, this is usually because of overloaded host machine


As stated in comment above, could you please try to reproduce the problem with not so much overloaded host?

Honza

Best regards,
--Edwin


_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to