>>> Edwin Török <edvin.to...@citrix.com> schrieb am 20.02.2019 um 12:30 in Nachricht <0a49f593-1543-76e4-a8ab-06a48c596...@citrix.com>: > On 20/02/2019 07:57, Jan Friesse wrote: >> Edwin, >>> >>> >>> On 19/02/2019 17:02, Klaus Wenninger wrote: >>>> On 02/19/2019 05:41 PM, Edwin Török wrote: >>>>> On 19/02/2019 16:26, Edwin Török wrote: >>>>>> On 18/02/2019 18:27, Edwin Török wrote: >>>>>>> Did a test today with CentOS 7.6 with upstream kernel and with >>>>>>> 4.20.10-1.el7.elrepo.x86_64 (tested both with upstream SBD, and our >>>>>>> patched [1] SBD) and was not able to reproduce the issue yet. >>>>>> I was able to finally reproduce this using only upstream components >>>>>> (although it seems to be easier to reproduce if we use our patched >>>>>> SBD, >>>>>> I was able to reproduce this by using only upstream packages unpatched >>>>>> by us): >>>> >>>> Just out of curiosity: What did you patch in SBD? >>>> Sorry if I missed the answer in the previous communication. >>> >>> It is mostly this PR, which calls getquorate quite often (a more >>> efficient impl. would be to use the quorum notification API like >>> dlm/pacemaker do, although see concerns in >>> https://lists.clusterlabs.org/pipermail/users/2019-February/016249.html):
>>> https://github.com/ClusterLabs/sbd/pull/27 >>> >>> We have also added our own servant for watching the health of our >>> control plane, but that is not relevant to this bug (it reproduces with >>> that watcher turned off too). >>> >>>> >>>>> I was also able to get a corosync blackbox from one of the stuck VMs >>>>> that showed something interesting: >>>>> https://clbin.com/d76Ha >>>>> >>>>> It is looping on: >>>>> debug Feb 19 16:37:24 mcast_sendmsg(408):12: sendmsg(mcast) failed >>>>> (non-critical): Resource temporarily unavailable (11) >>>> >>>> Hmm ... something like tx-queue of the device full, or no buffers >>>> available anymore and kernel-thread doing the cleanup isn't >>>> scheduled ... >>> >>> Yes that is very plausible. Perhaps it'd be nicer if corosync went back >>> to the epoll_wait loop when it gets too many EAGAINs from sendmsg. >> >> But this is exactly what happens. Corosync will call sendmsg to all >> active udpu members and returns back to main loop -> epoll_wait. >> >>> (although this seems different from the original bug where it got stuck >>> in epoll_wait) >> >> I'm pretty sure it is. >> >> Anyway, let's try "sched_yield" idea. Could you please try included >> patch and see if it makes any difference (only for udpu)? > > Thanks for the patch, unfortunately corosync still spins 106% even with > yield: > https://clbin.com/CF64x > > On another host corosync failed to start up completely (Denied > connection not ready), and: > https://clbin.com/Z35Gl > (I don't think this is related to the patch, it was doing that before > when I looked at it this morning, kernel 4.20.0 this time) I wonder: Is it possible to run "iftop" and "top" (with proper high-speed setting showing all threads and CPUs) while waiting for the problem to occur. If I understand it correctly all those other terminals should freeze, so you'll have plenty of time for snapshotting the output ;-) I expect that your network load will be close to 100% on the interface, or the CPU handling traffic is busy with running corosync. > > Best regards, > --Edwin > >> >> Regards, >> Honza >> >>> >>>> Does the kernel log anything in that situation? >>> >>> Other than the crmd segfault no. >>> From previous observations on xenserver the softirqs were all stuck on >>> the CPU that corosync hogged 100% (I'll check this on upstream, but I'm >>> fairly sure it'll be the same). softirqs do not run at realtime priority >>> (if we increase the priority of ksoftirqd to realtime then it all gets >>> unstuck), but seem to be essential for whatever corosync is stuck >>> waiting on, in this case likely the sending/receiving of network packets. >>> >>> I'm trying to narrow down the kernel between 4.19.16 and 4.20.10 to see >>> why this was only reproducible on 4.19 so far. >>> >>> Best regards, >>> --Edwin >>> >>> >>> >>> _______________________________________________ >>> Users mailing list: Users@clusterlabs.org >>> https://lists.clusterlabs.org/mailman/listinfo/users >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >>> >> > _______________________________________________ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org