[ClusterLabs] Antw: Re: corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Ulrich Windl
>>> Edwin Török  schrieb am 20.02.2019 um 12:30 in
Nachricht <0a49f593-1543-76e4-a8ab-06a48c596...@citrix.com>:
> On 20/02/2019 07:57, Jan Friesse wrote:
>> Edwin,
>>>
>>>
>>> On 19/02/2019 17:02, Klaus Wenninger wrote:
 On 02/19/2019 05:41 PM, Edwin Török wrote:
> On 19/02/2019 16:26, Edwin Török wrote:
>> On 18/02/2019 18:27, Edwin Török wrote:
>>> Did a test today with CentOS 7.6 with upstream kernel and with
>>> 4.20.10-1.el7.elrepo.x86_64 (tested both with upstream SBD, and our
>>> patched [1] SBD) and was not able to reproduce the issue yet.
>> I was able to finally reproduce this using only upstream components
>> (although it seems to be easier to reproduce if we use our patched
>> SBD,
>> I was able to reproduce this by using only upstream packages unpatched
>> by us):

 Just out of curiosity: What did you patch in SBD?
 Sorry if I missed the answer in the previous communication.
>>>
>>> It is mostly this PR, which calls getquorate quite often (a more
>>> efficient impl. would be to use the quorum notification API like
>>> dlm/pacemaker do, although see concerns in
>>> https://lists.clusterlabs.org/pipermail/users/2019-February/016249.html):

>>> https://github.com/ClusterLabs/sbd/pull/27 
>>>
>>> We have also added our own servant for watching the health of our
>>> control plane, but that is not relevant to this bug (it reproduces with
>>> that watcher turned off too).
>>>

> I was also able to get a corosync blackbox from one of the stuck VMs
> that showed something interesting:
> https://clbin.com/d76Ha 
>
> It is looping on:
> debug   Feb 19 16:37:24 mcast_sendmsg(408):12: sendmsg(mcast) failed
> (non-critical): Resource temporarily unavailable (11)

 Hmm ... something like tx-queue of the device full, or no buffers
 available anymore and kernel-thread doing the cleanup isn't
 scheduled ...
>>>
>>> Yes that is very plausible. Perhaps it'd be nicer if corosync went back
>>> to the epoll_wait loop when it gets too many EAGAINs from sendmsg.
>> 
>> But this is exactly what happens. Corosync will call sendmsg to all
>> active udpu members and returns back to main loop -> epoll_wait.
>> 
>>> (although this seems different from the original bug where it got stuck
>>> in epoll_wait)
>> 
>> I'm pretty sure it is.
>> 
>> Anyway, let's try "sched_yield" idea. Could you please try included
>> patch and see if it makes any difference (only for udpu)?
> 
> Thanks for the patch, unfortunately corosync still spins 106% even with
> yield:
> https://clbin.com/CF64x 
> 
> On another host corosync failed to start up completely (Denied
> connection not ready), and:
> https://clbin.com/Z35Gl 
> (I don't think this is related to the patch, it was doing that before
> when I looked at it this morning, kernel 4.20.0 this time)

I wonder: Is it possible to run "iftop" and "top" (with proper high-speed
setting showing all threads and CPUs) while waiting for the problem to occur.
If I understand it correctly all those other terminals should freeze, so you'll
have plenty of time for snapshotting the output ;-) I expect that your network
load will be close to 100% on the interface, or the CPU handling traffic is
busy with running corosync.

> 
> Best regards,
> --Edwin
> 
>> 
>> Regards,
>>   Honza
>> 
>>>
 Does the kernel log anything in that situation?
>>>
>>> Other than the crmd segfault no.
>>>  From previous observations on xenserver the softirqs were all stuck on
>>> the CPU that corosync hogged 100% (I'll check this on upstream, but I'm
>>> fairly sure it'll be the same). softirqs do not run at realtime priority
>>> (if we increase the priority of ksoftirqd to realtime then it all gets
>>> unstuck), but seem to be essential for whatever corosync is stuck
>>> waiting on, in this case likely the sending/receiving of network packets.
>>>
>>> I'm trying to narrow down the kernel between 4.19.16 and 4.20.10 to see
>>> why this was only reproducible on 4.19 so far.
>>>
>>> Best regards,
>>> --Edwin
>>>
>>>
>>>
>>> ___
>>> Users mailing list: Users@clusterlabs.org 
>>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>>
>>> Project Home: http://www.clusterlabs.org 
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>> Bugs: http://bugs.clusterlabs.org 
>>>
>> 
> ___
> Users mailing list: Users@clusterlabs.org 
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 



___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.cl

Re: [ClusterLabs] Antw: Re: corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Klaus Wenninger
On 02/20/2019 08:07 AM, Ulrich Windl wrote:
 Klaus Wenninger  schrieb am 19.02.2019 um 18:02 in
> Nachricht <7b626ca1-4f59-6257-bfb5-ef5d0d823...@redhat.com>:
> [...]
>>> It is looping on:
>>> debug   Feb 19 16:37:24 mcast_sendmsg(408):12: sendmsg(mcast) failed
>>> (non-critical): Resource temporarily unavailable (11)
> I wonder whether this is the reason for looping or the consequence of 
> loop-sending. To me it looks like a good idea to try sched_yield() in this 
> situation. Maybe then the other tasks have a chance to empty the send queue.

Doesn't that just trigger RR. So if the other threads aren't SCHED_RR at the
same prio would it help?

>
>> Hmm ... something like tx-queue of the device full, or no buffers
>> available anymore and kernel-thread doing the cleanup isn't
>> scheduled ...
>> Does the kernel log anything in that situation?
>>
>>> Also noticed this:
>>> [ 5390.361861] crmd[12620]: segfault at 0 ip 7f221c5e03b1 sp
>>> 7ffcf9cf9d88 error 4 in libc-2.17.so[7f221c554000+1c2000]
>>> [ 5390.361918] Code: b8 00 00 00 04 00 00 00 74 07 48 8d 05 f8 f2 0d 00
>>> c3 0f 1f 80 00 00 00 00 48 31 c0 89 f9 83 e1 3f 66 0f ef c0 83 f9 30 77
>>> 19  0f 6f 0f 66 0f 74 c1 66 0f d7 d0 85 d2 75 7a 48 89 f8 48 83 e0
> Maybe time to enable core dumps...
>
> [...]
>
> Regards,
> Ulrich Windl
>
>


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-19 Thread Ulrich Windl
>>> Klaus Wenninger  schrieb am 19.02.2019 um 18:02 in
Nachricht <7b626ca1-4f59-6257-bfb5-ef5d0d823...@redhat.com>:
[...]
>>
>> It is looping on:
>> debug   Feb 19 16:37:24 mcast_sendmsg(408):12: sendmsg(mcast) failed
>> (non-critical): Resource temporarily unavailable (11)

I wonder whether this is the reason for looping or the consequence of 
loop-sending. To me it looks like a good idea to try sched_yield() in this 
situation. Maybe then the other tasks have a chance to empty the send queue.

> 
> Hmm ... something like tx-queue of the device full, or no buffers
> available anymore and kernel-thread doing the cleanup isn't
> scheduled ...
> Does the kernel log anything in that situation?
> 
>>
>> Also noticed this:
>> [ 5390.361861] crmd[12620]: segfault at 0 ip 7f221c5e03b1 sp
>> 7ffcf9cf9d88 error 4 in libc-2.17.so[7f221c554000+1c2000]
>> [ 5390.361918] Code: b8 00 00 00 04 00 00 00 74 07 48 8d 05 f8 f2 0d 00
>> c3 0f 1f 80 00 00 00 00 48 31 c0 89 f9 83 e1 3f 66 0f ef c0 83 f9 30 77
>> 19  0f 6f 0f 66 0f 74 c1 66 0f d7 d0 85 d2 75 7a 48 89 f8 48 83 e0

Maybe time to enable core dumps...

[...]

Regards,
Ulrich Windl


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-18 Thread Ulrich Windl
>>> Jan Pokorný  schrieb am 18.02.2019 um 21:08 in
Nachricht
<20190218200816.gd23...@redhat.com>:
> On 15/02/19 08:48 +0100, Jan Friesse wrote:
>> Ulrich Windl napsal(a):
>>> IMHO any process running at real-time priorities must make sure
>>> that it consumes the CPU only for short moment that are really
>>> critical to be performed in time.
> 
> Pardon me, Ulrich, but something is off about this, especially
> if meant in general.
> 
> Even if the infrastructure of the OS was entirely happy with
> switching scheduling parameters constantly and in the furious rate
> (I assume there may be quite a penalty when doing so, in the overhead
> caused with reconfiguration of the schedulers involved), the
> time-to-proceed-critical sections do not appear to be easily divisible
> (if at all) in the overall code flow of a singlethreaded program like
> corosync, since everything is time-critical in a sense (token and
> other timeouts are ticking), and offloading some side/non-critical
> tasks for asynchronous processing is likely not on the roadmap for
> corosync, given the historical move from multithreading (only retained
> for logging for which an extra precaution is needed so as to prevent
> priority inversion, which will generally always be a threat when
> unequal priority processes do interface, even if transitively).

That's what I'm talking about: If you let users pick the priority, every user
will pick the highest possible priority, because he/she expects to get better
service. I n fact they don't. And for real-time scheduling it can even halt the
system. For corosync this means that RT priority should be really used where
it's actually needed, not just to cope with poor performance.

> 
> The step around multithreading is to have another worker process
> with IPC of some sorts, but with that, you only add more overhead
> and complexity around such additionally managed queues into the
> game (+ possibly priority inversion yet again).

Yes, a real-time system has to be designed with real-time in mind all the
time; you can't make any system a realtime system just by using real-time
scheduling priorities.

> 
> BTW. regarding "must make sure" part, barring self-supervision
> of any sort (new complexity + overhead), that's a problem of
> fixed-priority scheduling assignment.  I've been recently raising
> an awareness of (Linux-specific) *deadline scheduler* [1,2], which:
> 
> - has even higher hierarchical priority compared to SCHED_RR
>   policy (making the latter possibly ineffective, which would
>   not be very desirable, I guess)
> 
> - may better express not only the actual requirements, but some
>   "under normal circumstances, using reasonably scoped HW for
>   the task" (speaking of hypothetical defaults now, possibly
>   user configurable and/or influenced with actual configured
>   timeouts at corosync level) upper boundary for how much of
>   CPU run-time shall be allowed for the process in absolute
>   terms, possibly preventing said livelock scenarios (being
>   throttled when exceeded, presumably speeding the loss of
>   the token and subsequent fencing up)
> 
> Note that in systemd deployments, it would be customary for
> the service launcher (unit file executor) to actually expose
> this stuff as yet another user-customizable wrapping around
> the actual run, but support for this very scheduling policy
> is currently missing[3].
> 
>>> Specifically having some code that performs poorly (for various
>>> reasons) is absolutely _not_ a candidate to be run with real-time
>>> priorities to fix the bad performance!
> 
> You've managed to flip (more-or-less, have no contrary evidence)
> isolated occurrence of evidently buggy behaviour to a generalized
> description of the performance of the involved pieces of SW.

Actually (independent of this issue) I always had the impression that corosync
is communicating too much (a lot of traffic while nothing s happening in the
cluster), and it easily breaks under load. And I had the impression that
developers tried to fix this by adding real-time priorities to the parts that
expose the problem. Which is the wrong type of fix IMHO...

> If that was that bad, we would hear there's not enough room for
> the actual clustered resources all the time, but I am not aware
> of that.

Depends on what "room" actually refers to: Would corosync ever work reasonably
on a single-CPU system? Yes that's poorly hypothetical, but there actually
exists software that deadlocks with only one CPU...

> 
> With buggy behaviour, I mean, logs from https://clbin.com/9kOUM 
> and https://github.com/ClusterLabs/libqb/commit/2a06ffecd bug fix
> from the past seem to have something in common, like high load
> as a surrounding circumstance, and the missed event/job (on,
> presumably a socket, fd=15 in the log ... since that never gets
> handled even when there's no other input event).  Guess that
> another look is needed at _poll_and_add_to_jobs_ function
> (not sure why it's without leading/trai

[ClusterLabs] Antw: Re: corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-17 Thread Ulrich Windl
Hi!

I also wonder: With SCHED_RR would a sched_yield() at a proper place the 100%
CPU loop also fix this issue? Or do you think "we need real-time, and cannot
allow any other task to run"?

Regards,
Ulrich

>>> Edwin Török  schrieb am 15.02.2019 um 17:58 in
Nachricht :
> On 15/02/2019 16:08, Christine Caulfield wrote:
>> On 15/02/2019 13:06, Edwin Török wrote:
>>> I tried again with 'debug: trace', lots of process pause here:
>>> https://clbin.com/ZUHpd 
>>>
>>> And here is an strace running realtime prio 99, a LOT of epoll_wait and
>>> sendmsg (gz format):
>>> https://clbin.com/JINiV 
>>>
>>> It detects large numbers of members left, but I think this is because
>>> the corosync on those hosts got similarly stuck:
>>> Feb 15 12:51:07 localhost corosync[29278]:  [TOTEM ] A new membership
>>> (10.62.161.158:3152) was formed. Members left: 2 14 3 9 5 11 4 12 8 13 7
>>> 1 10
>>> Feb 15 12:51:07 localhost corosync[29278]:  [TOTEM ] Failed to receive
>>> the leave message. failed: 2 14 3 9 5 11 4 12 8 13 7 1 10
>>>
>>> Looking on another host that is still stuck 100% corosync it says:
>>> https://clbin.com/6UOn6 
>>>
>> 
>> Thanks, that's really quite odd. I have vague recollections of a problem
>> where corosync was spinning on epoll without reading anything but can't
>> find the details at the moment, annoying.
>> 
>> Some thing you might be able to try that might help.
>> 
>> 1) is is possible to run without sbd. Sometimes too much polling from
>> clients can cause odd behaviour
>> 2) is it possible to try with a different kernel? We've tried a vanilla
>> 4.19 and it's fine, but not with the Xen patches obviously
> 
> I'll try with some bare-metal upstream distros and report back the repro
> steps if I can get it to reliably repro, hopefully early next week, it
> is unlikely I'll get a working repro today.
> 
>> 3) Does running corosync with the -p option help?
> 
> Yes, with "-p" I was able to run cluster create/GFS2 plug/unplug/destroy
> on 16 physical hosts in a loop for an hour with any crashes (previously
> it would crash within minutes).
> 
> I found another workaround too:
> echo NO_RT_RUNTIME_SHARE >/sys/kernel/debug/sched_features
> 
> This makes the 95% realtime process CPU limit from
> sched_rt_runtime_us/sched_rt_period_us apply per core, instead of
> globally, so there would be 5% time left for non-realtime tasks on each
> core. Seems to be enough to avoid the livelock, I was not able to
> observe corosync using high CPU % anymore.
> Still got more tests to run on this over the weekend, but looks promising.
> 
> This is a safety layer of course, to prevent the system from fencing if
> we encounter high CPU usage in corosync/libq. I am still interested in
> tracking down the corosync/libq issue as it shouldn't have happened in
> the first place.
> 
>> 
>> Is there any situation where this has worked? either with different
>> components or different corosync.conf files?
>> 
>> Also, and I don't think this is directly related to the issue, but I can
>> see configuration reloads happening from 2 nodes every 5 seconds. It's
>> very odd and maybe not what you want!
> 
> The configuration reloads are a way of triggering this bug reliably, I
> should've mentioned that earlier
> (the problem happens during a configuration reload, but not always, and
> by doing configuration reloads in a loop that just add/remove one node
> the problem can be triggered reliably within minutes).
> 
> 
> Best regards,
> --Edwin
> ___
> Users mailing list: Users@clusterlabs.org 
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 



___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-14 Thread Ulrich Windl
I would expect that, as strace interrupts the RT task to query the code; you 
should run strace at the same RT priority ;-)

>>> Edvin Torok  14.02.19 19.54 Uhr >>>
Apologies for top posting, the strace you asked for is available here (although 
running strace itself had side-effect of getting corosync out of the live lock):
https://clbin.com/9kOUM

Best regards,
--Edwin

From: Jan Friesse 
Sent: 14 February 2019 18:34
To: Cluster Labs - All topics related to open-source clustering welcomed; Edvin 
Torok
Cc: Mark Syms
Subject: Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with 
kernel 4.19, priority inversion/livelock?

Edwin,

> Hello,
>
> We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1.3.1/gfs2 on 4.19 and
> noticed a fundamental problem with realtime priorities:
> - corosync runs on CPU3, and interrupts for the NIC used by corosync are
> also routed to CPU3
> - corosync runs with SCHED_RR, ksoftirqd does not (should it?), but
> without it packets sent/received from that interface would not get processed
> - corosync is in a busy loop using 100% CPU, never giving a chance for
> softirqs to be processed (including TIMER and SCHED)
>
> This appears to be a priority inversion problem, if corosync runs as
> realtime then everything it needs (timers...) should be realtime as
> well, otherwise running as realtime guarantees we'll miss the watchdog
> deadline, instead of guaranteeing that we process the data before the
> deadline.
>
> Do you have some advice on what the expected realtime priorities would
> be for:
> - corosync
> - sbd
> - hard irqs
> - soft irqs
>
> Also would it be possible for corosync to avoid hogging the CPU in libqb?
> (Our hypothesis is that if softirqs are not processed then timers
> wouldn't work for processes on that CPU either)
>
> Some more background and analysis:
>
> We noticed that cluster membership changes were very unstable
> (especially as you approach 16 physical host clusters) causing the whole
> cluster to fence when adding a single node. This was previously working
> fairly reliably with a 4.4 based kernel.
>
> I've increased SBD timeout to 3600 to be able to investigate the problem
> and noticed that corosync was using 100% of CPU3 [1] and I immediately
> lost SSH access on eth0 (corosync was using eth1), where eth0's
> interrupts were also processed on CPU3 (and irqbalance didn't move it).
>
> IIUC SCHED_RR tasks should not be able to take up 100% of CPU, according
> to [3] it shouldn't be allowed to use more than 95% of CPU.
>
> Softirqs were not processed at all on CPU3 (see [2], the numbers in the
> CPU3 column did not change, the numbers in the other columns did).
> Tried decreasing priority of corosync using chrt to 1, which didn't
> help. I then increased the priority of ksoftirqd to 50 using chrt, which
> immediately solved the CPU usage problem on corosync.
>
> I tried a simple infinite loop program with realtime priority, but it
> didn't reproduce the problems with interrupts getting stuck.
>
>
> Three problems here:
> * all softirqs were stuck (not being processed) on CPU3, which included
> TIMER and SCHED. corosync relies quite heavily on timers, would lack of
> processing them cause the 100% CPU usage?
> * is there a kernel bug introduced between 4.4 - 4.19 that causes
> realtime tasks to not respect the 95% limit anymore? This would leave 5%
> time for IRQs, including NIC IRQs
> *  if corosync runs at higher priority than the kernel softirq thread
> processing NIC IRQ how is corosync expecting incoming packets to be
> processed, if it is hogging the CPU by receiving, polling and sending
> packets?
>
> On another host which exhibited the same problem I've run strace (which
> also had the side-effect of getting corosync unstuck from 100% CPU use
> after strace finished):
> 1 bind
> 5 close
> 688 epoll_wait
> 8 futex
> 1 getsockname
> 3 ioctl
> 1 open
> 3 recvfrom
> 190 recvmsg
> 87 sendmsg
> 9 sendto
> 4 socket
> 6 write
>

don't you have strace by any chance? Corosync itself shouldn't really
stuck for such long especially if softirq was stuck so it couldn't get
any packets, right? So it may be (take it as a shot in the dark) that
some error code is not processed and instead there is some kind of never
ending loop.

Honza

> On yet another host I've run gdb while corosync was stuck:
> Thread 2 (Thread 0x7f6fd0c9b700 (LWP 16245)):
> #0 0x7f6fd34a0afb in do_futex_wait.constprop.1 ()
> from /lib64/libpthread.so.0
> #1 0x7f6fd34a0b8f in __new_sem_wait_slow.constprop.0 ()
> from /lib64/libpthread.so.0
> #2 0x7f6fd34a0c2b in sem_wait@@GLIBC_2.2.5 () from
> /lib64/libpthread.so.0
> #3 0x7f6fd3b38991 in qb_logt_worker_thread () from /lib64/libqb.so.0
> #4 0x7f6fd349ae25 in start_thread () from /lib64/libpthread.so.0
> #5 0x7f6fd31c4bad in clone () from /lib64/libc.so.6
>
> Thread 1 (Thread 0x7f6fd43c7b80 (LWP 16242)):
> #0 0x7f6fd31c5183 in epoll_wait () from /lib64/libc.so.6
> #1 0x0