Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Jan Friesse Wed, 20 Feb 2019 05:10:55 -0800

Edwin Török napsal(a):

On 20/02/2019 07:57, Jan Friesse wrote:

Edwin,



On 19/02/2019 17:02, Klaus Wenninger wrote:

On 02/19/2019 05:41 PM, Edwin Török wrote:

On 19/02/2019 16:26, Edwin Török wrote:

On 18/02/2019 18:27, Edwin Török wrote:

Did a test today with CentOS 7.6 with upstream kernel and with
4.20.10-1.el7.elrepo.x86_64 (tested both with upstream SBD, and our
patched [1] SBD) and was not able to reproduce the issue yet.

I was able to finally reproduce this using only upstream components
(although it seems to be easier to reproduce if we use our patched
SBD,
I was able to reproduce this by using only upstream packages unpatched
by us):


Just out of curiosity: What did you patch in SBD?
Sorry if I missed the answer in the previous communication.


It is mostly this PR, which calls getquorate quite often (a more
efficient impl. would be to use the quorum notification API like
dlm/pacemaker do, although see concerns in
https://lists.clusterlabs.org/pipermail/users/2019-February/016249.html):
https://github.com/ClusterLabs/sbd/pull/27

We have also added our own servant for watching the health of our
control plane, but that is not relevant to this bug (it reproduces with
that watcher turned off too).

I was also able to get a corosync blackbox from one of the stuck VMs
that showed something interesting:
https://clbin.com/d76Ha

It is looping on:
debug   Feb 19 16:37:24 mcast_sendmsg(408):12: sendmsg(mcast) failed
(non-critical): Resource temporarily unavailable (11)


Hmm ... something like tx-queue of the device full, or no buffers
available anymore and kernel-thread doing the cleanup isn't
scheduled ...


Yes that is very plausible. Perhaps it'd be nicer if corosync went back
to the epoll_wait loop when it gets too many EAGAINs from sendmsg.


But this is exactly what happens. Corosync will call sendmsg to all
active udpu members and returns back to main loop -> epoll_wait.

(although this seems different from the original bug where it got stuck
in epoll_wait)


I'm pretty sure it is.

Anyway, let's try "sched_yield" idea. Could you please try included
patch and see if it makes any difference (only for udpu)?


Thanks for the patch, unfortunately corosync still spins 106% even with
yield:
https://clbin.com/CF64x

Yep, it was kind of expected, but at lost worth a try. How does stracelook when this happens?

Also Klaus had an idea to try remove sbd from the picture and trydifferent RR process to find out what happens. And I think it's againworth try.

Could you please try install/enable/starthttps://github.com/jfriesse/spausedd (packages built by copr arehttps://copr.fedorainfracloud.org/coprs/honzaf/spausedd/),disable/remove sbd and run your test?


On another host corosync failed to start up completely (Denied
connection not ready), and:
https://clbin.com/Z35Gl
(I don't think this is related to the patch, it was doing that before
when I looked at it this morning, kernel 4.20.0 this time)

This one looks kind of normal and I'm pretty sure it's unrelated (I'veseen it already sadly never was able to find a "reliable" reproducer)


Regards,
  Honza


Best regards,
--Edwin


Regards,
   Honza

Does the kernel log anything in that situation?


Other than the crmd segfault no.
  From previous observations on xenserver the softirqs were all stuck on
the CPU that corosync hogged 100% (I'll check this on upstream, but I'm
fairly sure it'll be the same). softirqs do not run at realtime priority
(if we increase the priority of ksoftirqd to realtime then it all gets
unstuck), but seem to be essential for whatever corosync is stuck
waiting on, in this case likely the sending/receiving of network packets.

I'm trying to narrow down the kernel between 4.19.16 and 4.20.10 to see
why this was only reproducible on 4.19 so far.

Best regards,
--Edwin



_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

Reply via email to