Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-26 Thread Edwin Török
On 26/02/2019 07:32, Jan Friesse wrote: > Edwin >> Török wrote: >>> Setup: 16 CentOS 7.6 VMs, 4 vCPUs, 4GiB RAM running on XenServer 7.6 >>> (Xen 4.7.6) >> >> 2 vCPUs makes this a lot easier to reproduce the lost network >> connectivity/fencing.

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-25 Thread Edwin Török
nel-4.4.52-4.0.12.x86_64.rpm (XenServer Lima) kernel-4.19.19-5.0.1.x86_64.rpm (XenServer master) The updated repro steps are: On 19/02/2019 16:26, Edwin Török wrote:> On 18/02/2019 18:27, Edwin Török wrote: > Setup: 16 CentOS 7.6 VMs, 4 vCPUs, 4GiB RAM running on XenServer 7.6 > (Xen 4.7

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-25 Thread Edwin Török
On 20/02/2019 23:47, Jan Pokorný wrote: > On 20/02/19 21:16 +0100, Klaus Wenninger wrote: >> On 02/20/2019 08:51 PM, Jan Pokorný wrote: >>> On 20/02/19 17:37 +, Edwin Török wrote: >>>> strace for the situation described below (corosync 95%, 1 >>>>

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Edwin Török
On 20/02/2019 13:08, Jan Friesse wrote: > Edwin Török napsal(a): >> On 20/02/2019 07:57, Jan Friesse wrote: >>> Edwin, >>>> >>>> >>>> On 19/02/2019 17:02, Klaus Wenninger wrote: >>>>> On 02/19/2019 05:41 PM, Edwin Török wrote: &g

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Edwin Török
On 20/02/2019 12:44, Jan Pokorný wrote: > On 19/02/19 16:41 +0000, Edwin Török wrote: >> Also noticed this: [ 5390.361861] crmd[12620]: segfault at 0 ip >> 7f221c5e03b1 sp 7ffcf9cf9d88 error 4 in >> libc-2.17.so[7f221c554000+1c2000] [ 5390.361918] Code: b8 00 00 &g

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-20 Thread Edwin Török
On 20/02/2019 07:57, Jan Friesse wrote: > Edwin, >> >> >> On 19/02/2019 17:02, Klaus Wenninger wrote: >>> On 02/19/2019 05:41 PM, Edwin Török wrote: >>>> On 19/02/2019 16:26, Edwin Török wrote: >>>>> On 18/02/2019 18:27, Edwin Török wro

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-19 Thread Edwin Török
On 19/02/2019 17:02, Klaus Wenninger wrote: > On 02/19/2019 05:41 PM, Edwin Török wrote: >> On 19/02/2019 16:26, Edwin Török wrote: >>> On 18/02/2019 18:27, Edwin Török wrote: >>>> Did a test today with CentOS 7.6 with upstream kernel and with >>>>

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-19 Thread Edwin Török
On 19/02/2019 16:26, Edwin Török wrote: > On 18/02/2019 18:27, Edwin Török wrote: >> Did a test today with CentOS 7.6 with upstream kernel and with >> 4.20.10-1.el7.elrepo.x86_64 (tested both with upstream SBD, and our >> patched [1] SBD) and was not able to reproduce

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-19 Thread Edwin Török
On 18/02/2019 18:27, Edwin Török wrote: > Did a test today with CentOS 7.6 with upstream kernel and with > 4.20.10-1.el7.elrepo.x86_64 (tested both with upstream SBD, and our > patched [1] SBD) and was not able to reproduce the issue yet. I was able to finally reproduce this using only

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-18 Thread Edwin Török
On 18/02/2019 15:49, Klaus Wenninger wrote: > On 02/18/2019 04:15 PM, Christine Caulfield wrote: >> On 15/02/2019 16:58, Edwin Török wrote: >>> On 15/02/2019 16:08, Christine Caulfield wrote: >>>> On 15/02/2019 13:06, Edwin Török wrote: >>>>> I tri

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-15 Thread Edwin Török
On 15/02/2019 16:08, Christine Caulfield wrote: > On 15/02/2019 13:06, Edwin Török wrote: >> I tried again with 'debug: trace', lots of process pause here: >> https://clbin.com/ZUHpd >> >> And here is an strace running realtime prio 99, a LOT of epoll_wait and >&

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-15 Thread Edwin Török
On 15/02/2019 11:12, Christine Caulfield wrote: > On 15/02/2019 10:56, Edwin Török wrote: >> On 15/02/2019 09:31, Christine Caulfield wrote: >>> On 14/02/2019 17:33, Edwin Török wrote: >>>> Hello, >>>> >>>> We were testing corosync 2.4.3

Re: [ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-15 Thread Edwin Török
On 15/02/2019 09:31, Christine Caulfield wrote: > On 14/02/2019 17:33, Edwin Török wrote: >> Hello, >> >> We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1.3.1/gfs2 on 4.19 and >> noticed a fundamental problem with realtime priorities: >> - corosync runs on CPU3

[ClusterLabs] corosync SCHED_RR stuck at 100% cpu usage with kernel 4.19, priority inversion/livelock?

2019-02-14 Thread Edwin Török
Hello, We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1.3.1/gfs2 on 4.19 and noticed a fundamental problem with realtime priorities: - corosync runs on CPU3, and interrupts for the NIC used by corosync are also routed to CPU3 - corosync runs with SCHED_RR, ksoftirqd does not (should it?), but

[ClusterLabs] dlm_controld does not recover from failed lockspace join

2019-01-08 Thread Edwin Török
Hello, We've seen an issue in production where DLM 4.0.7 gets "stuck" and unable to join more lockspaces. Other nodes in the cluster were able to join new lockspaces, but not the one that node 1 was stuck on. GFS2 was unaffected (the "stuck" lockspace was for a userspace control daemon, but thats

Re: [ClusterLabs] SLES11 SP4:SBD fencing problem with Xen (NMI not handled)?

2018-07-30 Thread Edwin Török
On 30/07/18 08:24, Ulrich Windl wrote: > Hi! > > We have a strange problem on one cluster node running Xen PV VMs (SLES11 > SP4): After updating the kernel and adding new SBD devices (to replace an old > storage system), the system just seems to freeze. Hi, Which version of Xen are you using

Re: [ClusterLabs] Antw: Growing a cluster from 1 node without fencing

2017-09-05 Thread Edwin Török
[Sorry for the long delay in replying, I was on vacation] On 14/08/17 15:30, Klaus Wenninger wrote: > If you have a disk you could use as shared-disk for sbd you could > achieve a quorum-disk-like-behavior. (your package-versions > look as if you are using RHEL-7.4) Thanks for the suggestion,

Re: [ClusterLabs] Antw: Growing a cluster from 1 node without fencing

2017-08-14 Thread Edwin Török
On 14/08/17 13:46, Klaus Wenninger wrote: > How does your /etc/sysconfig/sbd look like? > With just that pcs-command you get some default-config with > watchdog-only-support. It currently looks like this: SBD_DELAY_START=no SBD_OPTS="-n cluster1" SBD_PACEMAKER=yes SBD_STARTMODE=always

[ClusterLabs] Growing a cluster from 1 node without fencing

2017-08-14 Thread Edwin Török
Hi, When setting up a cluster with just 1 node with auto-tie-breaker and DLM, and incrementally adding more I got some unexpected fencing if the 2nd node doesn't join the cluster soon enough. What I also found surprising is that if the cluster has ever seen 2 nodes, then turning off the