On 15/02/2019 11:12, Christine Caulfield wrote: > On 15/02/2019 10:56, Edwin Török wrote: >> On 15/02/2019 09:31, Christine Caulfield wrote: >>> On 14/02/2019 17:33, Edwin Török wrote: >>>> Hello, >>>> >>>> We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1.3.1/gfs2 on 4.19 and >>>> noticed a fundamental problem with realtime priorities: >>>> - corosync runs on CPU3, and interrupts for the NIC used by corosync are >>>> also routed to CPU3 >>>> - corosync runs with SCHED_RR, ksoftirqd does not (should it?), but >>>> without it packets sent/received from that interface would not get >>>> processed >>>> - corosync is in a busy loop using 100% CPU, never giving a chance for >>>> softirqs to be processed (including TIMER and SCHED) >>>> >>> >>> >>> Can you tell me what distribution this is please? >> This is a not-yet-released development version of XenServer based on >> CentOS 7.5/7.6. >> The kernel is 4.19.19 + patches to make it work well with Xen >> (previously we were using a 4.4.52 + Xen patches and backports kernel) >> >> The versions of packages are: >> rpm -q libqb corosync dlm sbd kernel >> libqb-1.0.1-6.el7.x86_64 >> corosync-2.4.3-13.xs+2.0.0.x86_64 >> dlm-4.0.7-1.el7.x86_64 >> sbd-1.3.1-7.xs+2.0.0.x86_64 >> kernel-4.19.19-5.0.0.x86_64 >> >> Package versions with +xs in version have xenserver specific patches >> applied, libqb is coming straight from upstream CentOS here: >> https://git.centos.org/tree/rpms!libqb.git/fe522aa5e0af26c0cff1170b6d766b5f248778d2 >> >>> There are patches to >>> libqb that should be applied to fix a similar problem in 1.0.1-6 - but >>> that's a RHEL version and kernel 4.19 is not a RHEL 7 kernel, so I just >>> need to be sure that those fixes are in your libqb before going any >> further. >> >> We have libqb 1.0.1-6 from CentOS, it looks like there is 1.0.1-7 which >> includes an SHM crash fix, is this the one you were refering to, or is >> there an additional patch elsewhere? >> https://git.centos.org/commit/rpms!libqb.git/b5ede72cb0faf5b70ddd504822552fe97bfbbb5e >> > > Thanks. libqb-1.0.1-6 does have the patch I was thinking of - I mainly > wanted to check it wasn't someone else's package that didn't have that > patch in. The SHM patch in -7 fixes a race at shutdown (often seen with > sbd). That shouldn't be a problem because there is a workaround in -6 > anyway, and it's not fixing a spin, which is what we have here of course. > > Are there any messages in the system logs from either corosync or > related subsystems?
I tried again with 'debug: trace', lots of process pause here: https://clbin.com/ZUHpd And here is an strace running realtime prio 99, a LOT of epoll_wait and sendmsg (gz format): https://clbin.com/JINiV It detects large numbers of members left, but I think this is because the corosync on those hosts got similarly stuck: Feb 15 12:51:07 localhost corosync[29278]: [TOTEM ] A new membership (10.62.161.158:3152) was formed. Members left: 2 14 3 9 5 11 4 12 8 13 7 1 10 Feb 15 12:51:07 localhost corosync[29278]: [TOTEM ] Failed to receive the leave message. failed: 2 14 3 9 5 11 4 12 8 13 7 1 10 Looking on another host that is still stuck 100% corosync it says: https://clbin.com/6UOn6 Feb 15 13:01:56 localhost corosync[30153]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly. Feb 15 13:01:58 localhost corosync[30153]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault (reason: totem is continuously in gather state). The most common cause of this message is that the local firewall is configured improperly. Hope this helps, --Edwin _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org