On 02/20/2019 06:37 PM, Edwin Török wrote: > On 20/02/2019 13:08, Jan Friesse wrote: >> Edwin Török napsal(a): >>> On 20/02/2019 07:57, Jan Friesse wrote: >>>> Edwin, >>>>> >>>>> On 19/02/2019 17:02, Klaus Wenninger wrote: >>>>>> On 02/19/2019 05:41 PM, Edwin Török wrote: >>>>>>> On 19/02/2019 16:26, Edwin Török wrote: >>>>>>>> On 18/02/2019 18:27, Edwin Török wrote: >>>>>>>>> Did a test today with CentOS 7.6 with upstream kernel and with >>>>>>>>> 4.20.10-1.el7.elrepo.x86_64 (tested both with upstream SBD, and our >>>>>>>>> patched [1] SBD) and was not able to reproduce the issue yet. >>>>>>>> I was able to finally reproduce this using only upstream components >>>>>>>> (although it seems to be easier to reproduce if we use our patched >>>>>>>> SBD, >>>>>>>> I was able to reproduce this by using only upstream packages >>>>>>>> unpatched >>>>>>>> by us): >>>>>> Just out of curiosity: What did you patch in SBD? >>>>>> Sorry if I missed the answer in the previous communication. >>>>> It is mostly this PR, which calls getquorate quite often (a more >>>>> efficient impl. would be to use the quorum notification API like >>>>> dlm/pacemaker do, although see concerns in >>>>> https://lists.clusterlabs.org/pipermail/users/2019-February/016249.html): >>>>> >>>>> https://github.com/ClusterLabs/sbd/pull/27 >>>>> >>>>> We have also added our own servant for watching the health of our >>>>> control plane, but that is not relevant to this bug (it reproduces with >>>>> that watcher turned off too). >>>>> >>>>>>> I was also able to get a corosync blackbox from one of the stuck VMs >>>>>>> that showed something interesting: >>>>>>> https://clbin.com/d76Ha >>>>>>> >>>>>>> It is looping on: >>>>>>> debug Feb 19 16:37:24 mcast_sendmsg(408):12: sendmsg(mcast) failed >>>>>>> (non-critical): Resource temporarily unavailable (11) >>>>>> Hmm ... something like tx-queue of the device full, or no buffers >>>>>> available anymore and kernel-thread doing the cleanup isn't >>>>>> scheduled ... >>>>> Yes that is very plausible. Perhaps it'd be nicer if corosync went back >>>>> to the epoll_wait loop when it gets too many EAGAINs from sendmsg. >>>> But this is exactly what happens. Corosync will call sendmsg to all >>>> active udpu members and returns back to main loop -> epoll_wait. >>>> >>>>> (although this seems different from the original bug where it got stuck >>>>> in epoll_wait) >>>> I'm pretty sure it is. >>>> >>>> Anyway, let's try "sched_yield" idea. Could you please try included >>>> patch and see if it makes any difference (only for udpu)? >>> Thanks for the patch, unfortunately corosync still spins 106% even with >>> yield: >>> https://clbin.com/CF64x >> Yep, it was kind of expected, but at lost worth a try. How does strace >> look when this happens? >> > strace for the situation described below (corosync 95%, 1 vCPU): > https://clbin.com/hZL5z > >> Also Klaus had an idea to try remove sbd from the picture and try >> different RR process to find out what happens. And I think it's again >> worth try. >> >> Could you please try install/enable/start >> https://github.com/jfriesse/spausedd (packages built by copr are >> https://copr.fedorainfracloud.org/coprs/honzaf/spausedd/), >> disable/remove sbd and run your test? >> > Tried this with 4 vCPUs but it didn't detect any pauses (kind of > expected, plenty of other CPUs to have spausedd scheduled on even if > corosync hogs one). > Then tried with 1 vCPU and it didn't detect any pauses here either. The > 95% max realtime runtime protection kicks in and limits corosync to 95% > CPU since now global 95% = this CPU's 95% since there is only one. > Interestingly corosync stays running at 95% CPU usage now, unlike in SMP > scenarios where the 95% limit was enough to avoid the situation.
Hmm maybe the thing that should be scheduled is running at SCHED_RR as well but with just a lower prio. So it wouldn't profit from the sched_yield and it wouldn't get anything of the 5% either. > There is a visible keyboard lag over SSH, but spausedd does get scheduled: But it shows at least that the sched_yield is doing something ... although the RR might be triggered even without the explicit sched_yield in cases like this. > > 4923 root rt 0 213344 111036 84732 R 93.3 2.8 19:07.50 > corosync > > 9 root 20 0 0 0 0 S 0.3 0.0 0:02.02 > ksoftirqd/0 > > 4256 root rt 0 88804 35612 34368 R 0.3 0.9 0:04.36 > spausedd > > No SBD running, and corosync-blackbox does not work at all now. > > ifconfig eth0; sleep 1; ifconfig eth0 > eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 > inet 10.62.98.26 netmask 255.255.240.0 broadcast 10.62.111.255 > inet6 fd06:7768:b9e5:8c50:4c0e:daff:fe14:55f0 prefixlen 64 > scopeid 0x0<global> > inet6 fe80::4c0e:daff:fe14:55f0 prefixlen 64 scopeid 0x20<link> > ether 4e:0e:da:14:55:f0 txqueuelen 1000 (Ethernet) > RX packets 4929031 bytes 4644924223 (4.3 GiB) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 15445220 bytes 22485760076 (20.9 GiB) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 > inet 10.62.98.26 netmask 255.255.240.0 broadcast 10.62.111.255 > inet6 fd06:7768:b9e5:8c50:4c0e:daff:fe14:55f0 prefixlen 64 > scopeid 0x0<global> > inet6 fe80::4c0e:daff:fe14:55f0 prefixlen 64 scopeid 0x20<link> > ether 4e:0e:da:14:55:f0 txqueuelen 1000 (Ethernet) > RX packets 4934042 bytes 4649642415 (4.3 GiB) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 15464987 bytes 22514562910 (20.9 GiB) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > Looking at this from the hypervisor side the VM is not sending many > packets once CPU usage goes up: > https://pasteboard.co/I246a1W.png > >>> On another host corosync failed to start up completely (Denied >>> connection not ready), and: >>> https://clbin.com/Z35Gl >>> (I don't think this is related to the patch, it was doing that before >>> when I looked at it this morning, kernel 4.20.0 this time) >> This one looks kind of normal and I'm pretty sure it's unrelated (I've >> seen it already sadly never was able to find a "reliable" reproducer) > Got a fairly reliable one now, only 3/16 VMs were able to start > corosync, one got stuck with CPU usage as shown above, the others failed > to start with 'Denied connection'. > > In the past we've seen this kind of problem when dhclient happened to > take up and listen on corosync's port, but thats been fixed in recent > CentOS 7.x releases, and is not the case now. > > Anything you'd like me to try help debug this? > > Thanks, > --Edwin _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org