On 26/02/2019 07:32, Jan Friesse wrote:
> Edwin
>> Török wrote:
>>> Setup: 16 CentOS 7.6 VMs, 4 vCPUs, 4GiB RAM running on XenServer 7.6
>>> (Xen 4.7.6)
>>
>> 2 vCPUs makes this a lot easier to reproduce the lost network
>> connectivity/fencing.
>> 1 vCPU reproduces just the high CPU usage, but
Edwin
Hi,
I've done some more tests and I am now able to reproduce the 100% CPU
usage / infinite loop of corosync/libqb on all the kernels that I
tested. Therefore I think this is NOT a regression in 4.19 kernel.
That's good
A suitable workaround for 4.19+ kernels is to run this on startup
On 02/25/2019 03:10 PM, Edwin Török wrote:
> Hi,
>
> I've done some more tests and I am now able to reproduce the 100% CPU
> usage / infinite loop of corosync/libqb on all the kernels that I
> tested. Therefore I think this is NOT a regression in 4.19 kernel.
>
> A suitable workaround for 4.19+
Hi,
I've done some more tests and I am now able to reproduce the 100% CPU
usage / infinite loop of corosync/libqb on all the kernels that I
tested. Therefore I think this is NOT a regression in 4.19 kernel.
A suitable workaround for 4.19+ kernels is to run this on startup (e.g.
from tmpfiles.d):
On 20/02/2019 23:47, Jan Pokorný wrote:
> On 20/02/19 21:16 +0100, Klaus Wenninger wrote:
>> On 02/20/2019 08:51 PM, Jan Pokorný wrote:
>>> On 20/02/19 17:37 +, Edwin Török wrote:
strace for the situation described below (corosync 95%, 1
vCPU): https://clbin.com/hZL5z
>>> I might
Edwin,
On 20/02/2019 13:08, Jan Friesse wrote:
Edwin Török napsal(a):
On 20/02/2019 07:57, Jan Friesse wrote:
Edwin,
On 19/02/2019 17:02, Klaus Wenninger wrote:
On 02/19/2019 05:41 PM, Edwin Török wrote:
On 19/02/2019 16:26, Edwin Török wrote:
On 18/02/2019 18:27, Edwin Török wrote:
On 20/02/19 21:16 +0100, Klaus Wenninger wrote:
> On 02/20/2019 08:51 PM, Jan Pokorný wrote:
>> On 20/02/19 17:37 +, Edwin Török wrote:
>>> strace for the situation described below (corosync 95%, 1 vCPU):
>>> https://clbin.com/hZL5z
>> I might have missed that earlier or this may be just some
On 20/02/19 21:25 +0100, Klaus Wenninger wrote:
> Hmm maybe the thing that should be scheduled is running at
> SCHED_RR as well but with just a lower prio. So it wouldn't
> profit from the sched_yield and it wouldn't get anything of
> the 5% either.
Actually, it would possibly make the situation
On 02/20/2019 06:37 PM, Edwin Török wrote:
> On 20/02/2019 13:08, Jan Friesse wrote:
>> Edwin Török napsal(a):
>>> On 20/02/2019 07:57, Jan Friesse wrote:
Edwin,
>
> On 19/02/2019 17:02, Klaus Wenninger wrote:
>> On 02/19/2019 05:41 PM, Edwin Török wrote:
>>> On 19/02/2019
On 02/20/2019 08:51 PM, Jan Pokorný wrote:
> On 20/02/19 17:37 +, Edwin Török wrote:
>> strace for the situation described below (corosync 95%, 1 vCPU):
>> https://clbin.com/hZL5z
> I might have missed that earlier or this may be just some sort
> of insignificant/misleading clue:
>
>> strace:
On 20/02/19 17:37 +, Edwin Török wrote:
> strace for the situation described below (corosync 95%, 1 vCPU):
> https://clbin.com/hZL5z
I might have missed that earlier or this may be just some sort
of insignificant/misleading clue:
> strace: Process 4923 attached with 2 threads
> strace: [
On 20/02/2019 13:08, Jan Friesse wrote:
> Edwin Török napsal(a):
>> On 20/02/2019 07:57, Jan Friesse wrote:
>>> Edwin,
On 19/02/2019 17:02, Klaus Wenninger wrote:
> On 02/19/2019 05:41 PM, Edwin Török wrote:
>> On 19/02/2019 16:26, Edwin Török wrote:
>>> On 18/02/2019
On Wed, 2019-02-20 at 14:03 +, Edwin Török wrote:
>
> On 20/02/2019 12:44, Jan Pokorný wrote:
> > On 19/02/19 16:41 +, Edwin Török wrote:
> > > Also noticed this: [ 5390.361861] crmd[12620]: segfault at 0 ip
> > > 7f221c5e03b1 sp 7ffcf9cf9d88 error 4 in
> > >
On 20/02/2019 12:44, Jan Pokorný wrote:
> On 19/02/19 16:41 +, Edwin Török wrote:
>> Also noticed this: [ 5390.361861] crmd[12620]: segfault at 0 ip
>> 7f221c5e03b1 sp 7ffcf9cf9d88 error 4 in
>> libc-2.17.so[7f221c554000+1c2000] [ 5390.361918] Code: b8 00 00
>> 00 04 00 00 00 74 07
Edwin Török napsal(a):
On 20/02/2019 07:57, Jan Friesse wrote:
Edwin,
On 19/02/2019 17:02, Klaus Wenninger wrote:
On 02/19/2019 05:41 PM, Edwin Török wrote:
On 19/02/2019 16:26, Edwin Török wrote:
On 18/02/2019 18:27, Edwin Török wrote:
Did a test today with CentOS 7.6 with upstream
On 19/02/19 16:41 +, Edwin Török wrote:
> Also noticed this:
> [ 5390.361861] crmd[12620]: segfault at 0 ip 7f221c5e03b1 sp
> 7ffcf9cf9d88 error 4 in libc-2.17.so[7f221c554000+1c2000]
> [ 5390.361918] Code: b8 00 00 00 04 00 00 00 74 07 48 8d 05 f8 f2 0d 00
> c3 0f 1f 80 00 00 00 00 48
On 20/02/2019 07:57, Jan Friesse wrote:
> Edwin,
>>
>>
>> On 19/02/2019 17:02, Klaus Wenninger wrote:
>>> On 02/19/2019 05:41 PM, Edwin Török wrote:
On 19/02/2019 16:26, Edwin Török wrote:
> On 18/02/2019 18:27, Edwin Török wrote:
>> Did a test today with CentOS 7.6 with upstream
Edwin,
On 19/02/2019 17:02, Klaus Wenninger wrote:
On 02/19/2019 05:41 PM, Edwin Török wrote:
On 19/02/2019 16:26, Edwin Török wrote:
On 18/02/2019 18:27, Edwin Török wrote:
Did a test today with CentOS 7.6 with upstream kernel and with
4.20.10-1.el7.elrepo.x86_64 (tested both with
On 02/19/2019 06:21 PM, Edwin Török wrote:
>
> On 19/02/2019 17:02, Klaus Wenninger wrote:
>> On 02/19/2019 05:41 PM, Edwin Török wrote:
>>> On 19/02/2019 16:26, Edwin Török wrote:
On 18/02/2019 18:27, Edwin Török wrote:
> Did a test today with CentOS 7.6 with upstream kernel and with
On 19/02/2019 17:02, Klaus Wenninger wrote:
> On 02/19/2019 05:41 PM, Edwin Török wrote:
>> On 19/02/2019 16:26, Edwin Török wrote:
>>> On 18/02/2019 18:27, Edwin Török wrote:
Did a test today with CentOS 7.6 with upstream kernel and with
4.20.10-1.el7.elrepo.x86_64 (tested both with
On 02/19/2019 05:41 PM, Edwin Török wrote:
> On 19/02/2019 16:26, Edwin Török wrote:
>> On 18/02/2019 18:27, Edwin Török wrote:
>>> Did a test today with CentOS 7.6 with upstream kernel and with
>>> 4.20.10-1.el7.elrepo.x86_64 (tested both with upstream SBD, and our
>>> patched [1] SBD) and was
On 19/02/2019 16:26, Edwin Török wrote:
> On 18/02/2019 18:27, Edwin Török wrote:
>> Did a test today with CentOS 7.6 with upstream kernel and with
>> 4.20.10-1.el7.elrepo.x86_64 (tested both with upstream SBD, and our
>> patched [1] SBD) and was not able to reproduce the issue yet.
>
> I was
On 18/02/2019 18:27, Edwin Török wrote:
> Did a test today with CentOS 7.6 with upstream kernel and with
> 4.20.10-1.el7.elrepo.x86_64 (tested both with upstream SBD, and our
> patched [1] SBD) and was not able to reproduce the issue yet.
I was able to finally reproduce this using only upstream
On 15/02/19 08:48 +0100, Jan Friesse wrote:
> Ulrich Windl napsal(a):
>> IMHO any process running at real-time priorities must make sure
>> that it consumes the CPU only for short moment that are really
>> critical to be performed in time.
Pardon me, Ulrich, but something is off about this,
On 18/02/2019 15:49, Klaus Wenninger wrote:
> On 02/18/2019 04:15 PM, Christine Caulfield wrote:
>> On 15/02/2019 16:58, Edwin Török wrote:
>>> On 15/02/2019 16:08, Christine Caulfield wrote:
On 15/02/2019 13:06, Edwin Török wrote:
> I tried again with 'debug: trace', lots of process
On 02/18/2019 04:15 PM, Christine Caulfield wrote:
> On 15/02/2019 16:58, Edwin Török wrote:
>> On 15/02/2019 16:08, Christine Caulfield wrote:
>>> On 15/02/2019 13:06, Edwin Török wrote:
I tried again with 'debug: trace', lots of process pause here:
https://clbin.com/ZUHpd
And
On 15/02/2019 16:58, Edwin Török wrote:
> On 15/02/2019 16:08, Christine Caulfield wrote:
>> On 15/02/2019 13:06, Edwin Török wrote:
>>> I tried again with 'debug: trace', lots of process pause here:
>>> https://clbin.com/ZUHpd
>>>
>>> And here is an strace running realtime prio 99, a LOT of
On 15/02/2019 16:08, Christine Caulfield wrote:
> On 15/02/2019 13:06, Edwin Török wrote:
>> I tried again with 'debug: trace', lots of process pause here:
>> https://clbin.com/ZUHpd
>>
>> And here is an strace running realtime prio 99, a LOT of epoll_wait and
>> sendmsg (gz format):
>>
On 15/02/2019 13:06, Edwin Török wrote:
>
>
> On 15/02/2019 11:12, Christine Caulfield wrote:
>> On 15/02/2019 10:56, Edwin Török wrote:
>>> On 15/02/2019 09:31, Christine Caulfield wrote:
On 14/02/2019 17:33, Edwin Török wrote:
> Hello,
>
> We were testing corosync 2.4.3/libqb
On 15/02/2019 11:12, Christine Caulfield wrote:
> On 15/02/2019 10:56, Edwin Török wrote:
>> On 15/02/2019 09:31, Christine Caulfield wrote:
>>> On 14/02/2019 17:33, Edwin Török wrote:
Hello,
We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1.3.1/gfs2 on 4.19 and
noticed a
On 15/02/2019 10:56, Edwin Török wrote:
> On 15/02/2019 09:31, Christine Caulfield wrote:
>> On 14/02/2019 17:33, Edwin Török wrote:
>>> Hello,
>>>
>>> We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1.3.1/gfs2 on 4.19 and
>>> noticed a fundamental problem with realtime priorities:
>>> - corosync
On 15/02/2019 09:31, Christine Caulfield wrote:
> On 14/02/2019 17:33, Edwin Török wrote:
>> Hello,
>>
>> We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1.3.1/gfs2 on 4.19 and
>> noticed a fundamental problem with realtime priorities:
>> - corosync runs on CPU3, and interrupts for the NIC used
On 14/02/2019 17:33, Edwin Török wrote:
> Hello,
>
> We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1.3.1/gfs2 on 4.19 and
> noticed a fundamental problem with realtime priorities:
> - corosync runs on CPU3, and interrupts for the NIC used by corosync are
> also routed to CPU3
> - corosync runs
: Jan Friesse
Sent: 14 February 2019 18:34
To: Cluster Labs - All topics related to open-source clustering welcomed; Edvin
Torok
Cc: Mark Syms
Subject: Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with
kernel 4.19, priority inversion/livelock?
Edwin,
Hello,
We were testing
To: Cluster Labs - All topics related to open-source clustering welcomed; Edvin
Torok
Cc: Mark Syms
Subject: Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with
kernel 4.19, priority inversion/livelock?
Edwin,
> Hello,
>
> We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1
Edwin,
Hello,
We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1.3.1/gfs2 on 4.19 and
noticed a fundamental problem with realtime priorities:
- corosync runs on CPU3, and interrupts for the NIC used by corosync are
also routed to CPU3
- corosync runs with SCHED_RR, ksoftirqd does not (should
Hello,
We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1.3.1/gfs2 on 4.19 and
noticed a fundamental problem with realtime priorities:
- corosync runs on CPU3, and interrupts for the NIC used by corosync are
also routed to CPU3
- corosync runs with SCHED_RR, ksoftirqd does not (should it?), but
37 matches
Mail list logo