Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-26 Thread Edwin Török
On 26/02/2019 07:32, Jan Friesse wrote: > Edwin >> Török wrote: >>> Setup: 16 CentOS 7.6 VMs, 4 vCPUs, 4GiB RAM running on XenServer 7.6 >>> (Xen 4.7.6) >> >> 2 vCPUs makes this a lot easier to reproduce the lost network >> connectivity/fencing. >> 1 vCPU reproduces just the high CPU usage, but

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-25 Thread Jan Friesse
Edwin Hi, I've done some more tests and I am now able to reproduce the 100% CPU usage / infinite loop of corosync/libqb on all the kernels that I tested. Therefore I think this is NOT a regression in 4.19 kernel. That's good A suitable workaround for 4.19+ kernels is to run this on startup

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-25 Thread Klaus Wenninger
On 02/25/2019 03:10 PM, Edwin Török wrote: > Hi, > > I've done some more tests and I am now able to reproduce the 100% CPU > usage / infinite loop of corosync/libqb on all the kernels that I > tested. Therefore I think this is NOT a regression in 4.19 kernel. > > A suitable workaround for 4.19+

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-25 Thread Edwin Török
Hi, I've done some more tests and I am now able to reproduce the 100% CPU usage / infinite loop of corosync/libqb on all the kernels that I tested. Therefore I think this is NOT a regression in 4.19 kernel. A suitable workaround for 4.19+ kernels is to run this on startup (e.g. from tmpfiles.d):

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-25 Thread Edwin Török
On 20/02/2019 23:47, Jan Pokorný wrote: > On 20/02/19 21:16 +0100, Klaus Wenninger wrote: >> On 02/20/2019 08:51 PM, Jan Pokorný wrote: >>> On 20/02/19 17:37 +, Edwin Török wrote: strace for the situation described below (corosync 95%, 1 vCPU): https://clbin.com/hZL5z >>> I might

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-21 Thread Jan Friesse
Edwin, On 20/02/2019 13:08, Jan Friesse wrote: Edwin Török napsal(a): On 20/02/2019 07:57, Jan Friesse wrote: Edwin, On 19/02/2019 17:02, Klaus Wenninger wrote: On 02/19/2019 05:41 PM, Edwin Török wrote: On 19/02/2019 16:26, Edwin Török wrote: On 18/02/2019 18:27, Edwin Török wrote:

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Jan Pokorný
On 20/02/19 21:16 +0100, Klaus Wenninger wrote: > On 02/20/2019 08:51 PM, Jan Pokorný wrote: >> On 20/02/19 17:37 +, Edwin Török wrote: >>> strace for the situation described below (corosync 95%, 1 vCPU): >>> https://clbin.com/hZL5z >> I might have missed that earlier or this may be just some

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Jan Pokorný
On 20/02/19 21:25 +0100, Klaus Wenninger wrote: > Hmm maybe the thing that should be scheduled is running at > SCHED_RR as well but with just a lower prio. So it wouldn't > profit from the sched_yield and it wouldn't get anything of > the 5% either. Actually, it would possibly make the situation

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Klaus Wenninger
On 02/20/2019 06:37 PM, Edwin Török wrote: > On 20/02/2019 13:08, Jan Friesse wrote: >> Edwin Török napsal(a): >>> On 20/02/2019 07:57, Jan Friesse wrote: Edwin, > > On 19/02/2019 17:02, Klaus Wenninger wrote: >> On 02/19/2019 05:41 PM, Edwin Török wrote: >>> On 19/02/2019

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Klaus Wenninger
On 02/20/2019 08:51 PM, Jan Pokorný wrote: > On 20/02/19 17:37 +, Edwin Török wrote: >> strace for the situation described below (corosync 95%, 1 vCPU): >> https://clbin.com/hZL5z > I might have missed that earlier or this may be just some sort > of insignificant/misleading clue: > >> strace:

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Jan Pokorný
On 20/02/19 17:37 +, Edwin Török wrote: > strace for the situation described below (corosync 95%, 1 vCPU): > https://clbin.com/hZL5z I might have missed that earlier or this may be just some sort of insignificant/misleading clue: > strace: Process 4923 attached with 2 threads > strace: [

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Edwin Török
On 20/02/2019 13:08, Jan Friesse wrote: > Edwin Török napsal(a): >> On 20/02/2019 07:57, Jan Friesse wrote: >>> Edwin, On 19/02/2019 17:02, Klaus Wenninger wrote: > On 02/19/2019 05:41 PM, Edwin Török wrote: >> On 19/02/2019 16:26, Edwin Török wrote: >>> On 18/02/2019

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Ken Gaillot
On Wed, 2019-02-20 at 14:03 +, Edwin Török wrote: > > On 20/02/2019 12:44, Jan Pokorný wrote: > > On 19/02/19 16:41 +, Edwin Török wrote: > > > Also noticed this: [ 5390.361861] crmd[12620]: segfault at 0 ip > > > 7f221c5e03b1 sp 7ffcf9cf9d88 error 4 in > > >

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Edwin Török
On 20/02/2019 12:44, Jan Pokorný wrote: > On 19/02/19 16:41 +, Edwin Török wrote: >> Also noticed this: [ 5390.361861] crmd[12620]: segfault at 0 ip >> 7f221c5e03b1 sp 7ffcf9cf9d88 error 4 in >> libc-2.17.so[7f221c554000+1c2000] [ 5390.361918] Code: b8 00 00 >> 00 04 00 00 00 74 07

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Jan Friesse
Edwin Török napsal(a): On 20/02/2019 07:57, Jan Friesse wrote: Edwin, On 19/02/2019 17:02, Klaus Wenninger wrote: On 02/19/2019 05:41 PM, Edwin Török wrote: On 19/02/2019 16:26, Edwin Török wrote: On 18/02/2019 18:27, Edwin Török wrote: Did a test today with CentOS 7.6 with upstream

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Jan Pokorný
On 19/02/19 16:41 +, Edwin Török wrote: > Also noticed this: > [ 5390.361861] crmd[12620]: segfault at 0 ip 7f221c5e03b1 sp > 7ffcf9cf9d88 error 4 in libc-2.17.so[7f221c554000+1c2000] > [ 5390.361918] Code: b8 00 00 00 04 00 00 00 74 07 48 8d 05 f8 f2 0d 00 > c3 0f 1f 80 00 00 00 00 48

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Edwin Török
On 20/02/2019 07:57, Jan Friesse wrote: > Edwin, >> >> >> On 19/02/2019 17:02, Klaus Wenninger wrote: >>> On 02/19/2019 05:41 PM, Edwin Török wrote: On 19/02/2019 16:26, Edwin Török wrote: > On 18/02/2019 18:27, Edwin Török wrote: >> Did a test today with CentOS 7.6 with upstream

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-19 Thread Jan Friesse
Edwin, On 19/02/2019 17:02, Klaus Wenninger wrote: On 02/19/2019 05:41 PM, Edwin Török wrote: On 19/02/2019 16:26, Edwin Török wrote: On 18/02/2019 18:27, Edwin Török wrote: Did a test today with CentOS 7.6 with upstream kernel and with 4.20.10-1.el7.elrepo.x86_64 (tested both with

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-19 Thread Klaus Wenninger
On 02/19/2019 06:21 PM, Edwin Török wrote: > > On 19/02/2019 17:02, Klaus Wenninger wrote: >> On 02/19/2019 05:41 PM, Edwin Török wrote: >>> On 19/02/2019 16:26, Edwin Török wrote: On 18/02/2019 18:27, Edwin Török wrote: > Did a test today with CentOS 7.6 with upstream kernel and with

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-19 Thread Edwin Török
On 19/02/2019 17:02, Klaus Wenninger wrote: > On 02/19/2019 05:41 PM, Edwin Török wrote: >> On 19/02/2019 16:26, Edwin Török wrote: >>> On 18/02/2019 18:27, Edwin Török wrote: Did a test today with CentOS 7.6 with upstream kernel and with 4.20.10-1.el7.elrepo.x86_64 (tested both with

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-19 Thread Klaus Wenninger
On 02/19/2019 05:41 PM, Edwin Török wrote: > On 19/02/2019 16:26, Edwin Török wrote: >> On 18/02/2019 18:27, Edwin Török wrote: >>> Did a test today with CentOS 7.6 with upstream kernel and with >>> 4.20.10-1.el7.elrepo.x86_64 (tested both with upstream SBD, and our >>> patched [1] SBD) and was

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-19 Thread Edwin Török
On 19/02/2019 16:26, Edwin Török wrote: > On 18/02/2019 18:27, Edwin Török wrote: >> Did a test today with CentOS 7.6 with upstream kernel and with >> 4.20.10-1.el7.elrepo.x86_64 (tested both with upstream SBD, and our >> patched [1] SBD) and was not able to reproduce the issue yet. > > I was

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-19 Thread Edwin Török
On 18/02/2019 18:27, Edwin Török wrote: > Did a test today with CentOS 7.6 with upstream kernel and with > 4.20.10-1.el7.elrepo.x86_64 (tested both with upstream SBD, and our > patched [1] SBD) and was not able to reproduce the issue yet. I was able to finally reproduce this using only upstream

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-18 Thread Jan Pokorný
On 15/02/19 08:48 +0100, Jan Friesse wrote: > Ulrich Windl napsal(a): >> IMHO any process running at real-time priorities must make sure >> that it consumes the CPU only for short moment that are really >> critical to be performed in time. Pardon me, Ulrich, but something is off about this,

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-18 Thread Edwin Török
On 18/02/2019 15:49, Klaus Wenninger wrote: > On 02/18/2019 04:15 PM, Christine Caulfield wrote: >> On 15/02/2019 16:58, Edwin Török wrote: >>> On 15/02/2019 16:08, Christine Caulfield wrote: On 15/02/2019 13:06, Edwin Török wrote: > I tried again with 'debug: trace', lots of process

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-18 Thread Klaus Wenninger
On 02/18/2019 04:15 PM, Christine Caulfield wrote: > On 15/02/2019 16:58, Edwin Török wrote: >> On 15/02/2019 16:08, Christine Caulfield wrote: >>> On 15/02/2019 13:06, Edwin Török wrote: I tried again with 'debug: trace', lots of process pause here: https://clbin.com/ZUHpd And

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-18 Thread Christine Caulfield
On 15/02/2019 16:58, Edwin Török wrote: > On 15/02/2019 16:08, Christine Caulfield wrote: >> On 15/02/2019 13:06, Edwin Török wrote: >>> I tried again with 'debug: trace', lots of process pause here: >>> https://clbin.com/ZUHpd >>> >>> And here is an strace running realtime prio 99, a LOT of

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-15 Thread Edwin Török
On 15/02/2019 16:08, Christine Caulfield wrote: > On 15/02/2019 13:06, Edwin Török wrote: >> I tried again with 'debug: trace', lots of process pause here: >> https://clbin.com/ZUHpd >> >> And here is an strace running realtime prio 99, a LOT of epoll_wait and >> sendmsg (gz format): >>

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-15 Thread Christine Caulfield
On 15/02/2019 13:06, Edwin Török wrote: > > > On 15/02/2019 11:12, Christine Caulfield wrote: >> On 15/02/2019 10:56, Edwin Török wrote: >>> On 15/02/2019 09:31, Christine Caulfield wrote: On 14/02/2019 17:33, Edwin Török wrote: > Hello, > > We were testing corosync 2.4.3/libqb

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-15 Thread Edwin Török
On 15/02/2019 11:12, Christine Caulfield wrote: > On 15/02/2019 10:56, Edwin Török wrote: >> On 15/02/2019 09:31, Christine Caulfield wrote: >>> On 14/02/2019 17:33, Edwin Török wrote: Hello, We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1.3.1/gfs2 on 4.19 and noticed a

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-15 Thread Christine Caulfield
On 15/02/2019 10:56, Edwin Török wrote: > On 15/02/2019 09:31, Christine Caulfield wrote: >> On 14/02/2019 17:33, Edwin Török wrote: >>> Hello, >>> >>> We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1.3.1/gfs2 on 4.19 and >>> noticed a fundamental problem with realtime priorities: >>> - corosync

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-15 Thread Edwin Török
On 15/02/2019 09:31, Christine Caulfield wrote: > On 14/02/2019 17:33, Edwin Török wrote: >> Hello, >> >> We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1.3.1/gfs2 on 4.19 and >> noticed a fundamental problem with realtime priorities: >> - corosync runs on CPU3, and interrupts for the NIC used

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-15 Thread Christine Caulfield
On 14/02/2019 17:33, Edwin Török wrote: > Hello, > > We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1.3.1/gfs2 on 4.19 and > noticed a fundamental problem with realtime priorities: > - corosync runs on CPU3, and interrupts for the NIC used by corosync are > also routed to CPU3 > - corosync runs

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-14 Thread Jan Friesse
: Jan Friesse Sent: 14 February 2019 18:34 To: Cluster Labs - All topics related to open-source clustering welcomed; Edvin Torok Cc: Mark Syms Subject: Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock? Edwin, Hello, We were testing

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-14 Thread Edvin Torok
To: Cluster Labs - All topics related to open-source clustering welcomed; Edvin Torok Cc: Mark Syms Subject: Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock? Edwin, > Hello, > > We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-14 Thread Jan Friesse
Edwin, Hello, We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1.3.1/gfs2 on 4.19 and noticed a fundamental problem with realtime priorities: - corosync runs on CPU3, and interrupts for the NIC used by corosync are also routed to CPU3 - corosync runs with SCHED_RR, ksoftirqd does not (should

[ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-14 Thread Edwin Török
Hello, We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1.3.1/gfs2 on 4.19 and noticed a fundamental problem with realtime priorities: - corosync runs on CPU3, and interrupts for the NIC used by corosync are also routed to CPU3 - corosync runs with SCHED_RR, ksoftirqd does not (should it?), but