On 02/19/2019 05:41 PM, Edwin Török wrote: > On 19/02/2019 16:26, Edwin Török wrote: >> On 18/02/2019 18:27, Edwin Török wrote: >>> Did a test today with CentOS 7.6 with upstream kernel and with >>> 4.20.10-1.el7.elrepo.x86_64 (tested both with upstream SBD, and our >>> patched [1] SBD) and was not able to reproduce the issue yet. >> I was able to finally reproduce this using only upstream components >> (although it seems to be easier to reproduce if we use our patched SBD, >> I was able to reproduce this by using only upstream packages unpatched >> by us):
Just out of curiosity: What did you patch in SBD? Sorry if I missed the answer in the previous communication. > I was also able to get a corosync blackbox from one of the stuck VMs > that showed something interesting: > https://clbin.com/d76Ha > > It is looping on: > debug Feb 19 16:37:24 mcast_sendmsg(408):12: sendmsg(mcast) failed > (non-critical): Resource temporarily unavailable (11) Hmm ... something like tx-queue of the device full, or no buffers available anymore and kernel-thread doing the cleanup isn't scheduled ... Does the kernel log anything in that situation? > > Also noticed this: > [ 5390.361861] crmd[12620]: segfault at 0 ip 00007f221c5e03b1 sp > 00007ffcf9cf9d88 error 4 in libc-2.17.so[7f221c554000+1c2000] > [ 5390.361918] Code: b8 00 00 00 04 00 00 00 74 07 48 8d 05 f8 f2 0d 00 > c3 0f 1f 80 00 00 00 00 48 31 c0 89 f9 83 e1 3f 66 0f ef c0 83 f9 30 77 > 19 <f3> 0f 6f 0f 66 0f 74 c1 66 0f d7 d0 85 d2 75 7a 48 89 f8 48 83 e0 > > >> CentOS 7.6 vmlinuz-3.10.0-957.el7.x86_64: OK >> CentOS 7.6 vmlinuz-4.19.16-200.fc28.x86_64: 100% CPU usage corosync >> CentOS 7.6 vmlinuz-4.19-xen (XenServer): 100% CPU usage corosync >> CentOS 7.6 vmlinuz-4.20.10-1.el7.elrepo.x86_64: OK >> >> I got the 4.19.16 kernel from: >> https://koji.fedoraproject.org/koji/buildinfo?buildID=1180301 >> >> Setup: 16 CentOS 7.6 VMs, 4 vCPUs, 4GiB RAM running on XenServer 7.6 >> (Xen 4.7.6) >> Host is a Dell Poweredge R430, Xeon E5-2630 v3. >> >> On each VM: >> # yum install -y corosync dlm pcs pacemaker fence-agents-all sbd >> # echo mypassword | passwd hacluster --stdin >> # systemctl enable --now pcsd >> # echo xen_wdt >/etc/modules-load.d/watchdog.conf >> # modprobe xen_wdt >> # hostnamectl set-hostname host-<ip-address> >> >> On one host: >> # pcs cluster auth -u hacluster -p xenroot <allips> >> # pcs cluster setup --name cluster --auto_tie_breaker=1 <allips> >> >> # pcs stonith sbd enable >> # pcs cluster enable --all >> # pcs cluster start --all >> # pcs property set no-quorum-policy=freeze >> # pcs resource create dlm ocf:pacemaker:controld op monitor interval=30s >> on-fail=fence clone interleave=true ordered=true >> # pcs property set stonith-watchdog-timeout=10s >> >> In a loop on this host: >> # while true; do pcs cluster stop; pcs cluster start; corosync-cfgtool >> -R; done >> >> # rpm -q corosync pacemaker sbd libqb >> corosync-2.4.3-4.el7.x86_64 >> pacemaker-1.1.19-8.el7.x86_64 >> sbd-1.3.1-8.2.el7.x86_64 >> libqb-1.0.1-7.el7.x86_64 >> >> Watch the other VMs, if the bug happens you would loose SSH, or see >> corosync using 100% CPU, or notice that simply the pane with that VM is >> not updating. >> For watching the other VMs I used this script inside tmux and used 'setw >> synchronize-panes on': >> https://github.com/xapi-project/testarossa/blob/master/scripts/tmuxmulti.sh >> # scripts/tmuxmulti.sh 'ssh root@{}' 10.62.98.34 10.62.98.38 10.62.98.23 >> 10.62.98.30 10.62.98.40 10.62.98.36 10.62.98.29 10.62.98.35 10.62.98.28 >> 10.62.98.37 10.62.98.27 10.62.98.39 10.62.98.32 10.62.98.26 10.62.98.31 >> 10.62.98.33 >> >> Some VMs sometimes fence, some VMs just lock up (I think it depends how >> many VMs lock up, if it is too many the other ones loose quorum and >> fence correctly) and do not fence. >> >> Another observation: after reproducing the problem even if I stop the >> pcs cluster start/stop loop and reboot all VMs they seem to still end up >> in the bad 100% cpu usage state sometimes. >> >> P.S.: taking a disk+memory snapshot of the VM is also enough to get >> corosync out of the bad state, when the VM is resumed its cpu usage goes >> down to 0.3%. >> >> Here is how a frozen VM looks like (logged in via serial using `xl >> console`): >> top - 16:25:23 up 1:25, 3 users, load average: 3.86, 1.96, 0.78 >> Tasks: 133 total, 4 running, 68 sleeping, 0 stopped, 0 zombie >> %Cpu(s): 11.1 us, 14.3 sy, 0.0 ni, 49.2 id, 22.2 wa, 1.6 hi, 1.6 si, >> 0.0 st >> KiB Mem : 4005228 total, 3507244 free, 264960 used, 233024 buff/cache >> KiB Swap: 1048572 total, 1048572 free, 0 used. 3452980 avail Mem >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ >> COMMAND >> 4975 root rt 0 216460 114152 84732 R 100.0 2.9 4:08.29 >> corosync >> 1 root 20 0 191036 5300 3884 S 0.0 0.1 0:02.14 >> systemd >> 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 >> kthreadd >> 3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 >> rcu_gp >> 4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 >> rcu_par_gp >> 6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 >> kworker/0+ >> 7 root 20 0 0 0 0 I 0.0 0.0 0:00.00 >> kworker/u+ >> 8 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 >> mm_percpu+ >> 9 root 20 0 0 0 0 S 0.0 0.0 0:00.01 >> ksoftirqd+ >> 10 root 20 0 0 0 0 I 0.0 0.0 0:01.00 >> rcu_sched >> 11 root 20 0 0 0 0 I 0.0 0.0 0:00.00 >> rcu_bh >> 12 root rt 0 0 0 0 S 0.0 0.0 0:00.01 >> migration+ >> 14 root 20 0 0 0 0 S 0.0 0.0 0:00.00 >> cpuhp/0 >> 15 root 20 0 0 0 0 S 0.0 0.0 0:00.00 >> cpuhp/1 >> 16 root rt 0 0 0 0 S 0.0 0.0 0:00.01 >> migration+ >> 17 root 20 0 0 0 0 S 0.0 0.0 0:00.00 >> ksoftirqd+ >> 19 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 >> kworker/1+ >> >> [root@host-10 ~]# uname -a >> Linux host-10.62.98.36 4.19.16-200.fc28.x86_64 #1 SMP Thu Jan 17 >> 00:16:20 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux >> >> Best regards, >> --Edwin >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org