On 20/02/2019 07:57, Jan Friesse wrote: > Edwin, >> >> >> On 19/02/2019 17:02, Klaus Wenninger wrote: >>> On 02/19/2019 05:41 PM, Edwin Török wrote: >>>> On 19/02/2019 16:26, Edwin Török wrote: >>>>> On 18/02/2019 18:27, Edwin Török wrote: >>>>>> Did a test today with CentOS 7.6 with upstream kernel and with >>>>>> 4.20.10-1.el7.elrepo.x86_64 (tested both with upstream SBD, and our >>>>>> patched [1] SBD) and was not able to reproduce the issue yet. >>>>> I was able to finally reproduce this using only upstream components >>>>> (although it seems to be easier to reproduce if we use our patched >>>>> SBD, >>>>> I was able to reproduce this by using only upstream packages unpatched >>>>> by us): >>> >>> Just out of curiosity: What did you patch in SBD? >>> Sorry if I missed the answer in the previous communication. >> >> It is mostly this PR, which calls getquorate quite often (a more >> efficient impl. would be to use the quorum notification API like >> dlm/pacemaker do, although see concerns in >> https://lists.clusterlabs.org/pipermail/users/2019-February/016249.html): >> https://github.com/ClusterLabs/sbd/pull/27 >> >> We have also added our own servant for watching the health of our >> control plane, but that is not relevant to this bug (it reproduces with >> that watcher turned off too). >> >>> >>>> I was also able to get a corosync blackbox from one of the stuck VMs >>>> that showed something interesting: >>>> https://clbin.com/d76Ha >>>> >>>> It is looping on: >>>> debug Feb 19 16:37:24 mcast_sendmsg(408):12: sendmsg(mcast) failed >>>> (non-critical): Resource temporarily unavailable (11) >>> >>> Hmm ... something like tx-queue of the device full, or no buffers >>> available anymore and kernel-thread doing the cleanup isn't >>> scheduled ... >> >> Yes that is very plausible. Perhaps it'd be nicer if corosync went back >> to the epoll_wait loop when it gets too many EAGAINs from sendmsg. > > But this is exactly what happens. Corosync will call sendmsg to all > active udpu members and returns back to main loop -> epoll_wait. > >> (although this seems different from the original bug where it got stuck >> in epoll_wait) > > I'm pretty sure it is. > > Anyway, let's try "sched_yield" idea. Could you please try included > patch and see if it makes any difference (only for udpu)?
Thanks for the patch, unfortunately corosync still spins 106% even with yield: https://clbin.com/CF64x On another host corosync failed to start up completely (Denied connection not ready), and: https://clbin.com/Z35Gl (I don't think this is related to the patch, it was doing that before when I looked at it this morning, kernel 4.20.0 this time) Best regards, --Edwin > > Regards, > Honza > >> >>> Does the kernel log anything in that situation? >> >> Other than the crmd segfault no. >> From previous observations on xenserver the softirqs were all stuck on >> the CPU that corosync hogged 100% (I'll check this on upstream, but I'm >> fairly sure it'll be the same). softirqs do not run at realtime priority >> (if we increase the priority of ksoftirqd to realtime then it all gets >> unstuck), but seem to be essential for whatever corosync is stuck >> waiting on, in this case likely the sending/receiving of network packets. >> >> I'm trying to narrow down the kernel between 4.19.16 and 4.20.10 to see >> why this was only reproducible on 4.19 so far. >> >> Best regards, >> --Edwin >> >> >> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org