On 15/02/2019 10:56, Edwin Török wrote: > On 15/02/2019 09:31, Christine Caulfield wrote: >> On 14/02/2019 17:33, Edwin Török wrote: >>> Hello, >>> >>> We were testing corosync 2.4.3/libqb 1.0.1-6/sbd 1.3.1/gfs2 on 4.19 and >>> noticed a fundamental problem with realtime priorities: >>> - corosync runs on CPU3, and interrupts for the NIC used by corosync are >>> also routed to CPU3 >>> - corosync runs with SCHED_RR, ksoftirqd does not (should it?), but >>> without it packets sent/received from that interface would not get processed >>> - corosync is in a busy loop using 100% CPU, never giving a chance for >>> softirqs to be processed (including TIMER and SCHED) >>> >> >> >> Can you tell me what distribution this is please? > This is a not-yet-released development version of XenServer based on > CentOS 7.5/7.6. > The kernel is 4.19.19 + patches to make it work well with Xen > (previously we were using a 4.4.52 + Xen patches and backports kernel) > > The versions of packages are: > rpm -q libqb corosync dlm sbd kernel > libqb-1.0.1-6.el7.x86_64 > corosync-2.4.3-13.xs+2.0.0.x86_64 > dlm-4.0.7-1.el7.x86_64 > sbd-1.3.1-7.xs+2.0.0.x86_64 > kernel-4.19.19-5.0.0.x86_64 > > Package versions with +xs in version have xenserver specific patches > applied, libqb is coming straight from upstream CentOS here: > https://git.centos.org/tree/rpms!libqb.git/fe522aa5e0af26c0cff1170b6d766b5f248778d2 > >> There are patches to >> libqb that should be applied to fix a similar problem in 1.0.1-6 - but >> that's a RHEL version and kernel 4.19 is not a RHEL 7 kernel, so I just >> need to be sure that those fixes are in your libqb before going any > further. > > We have libqb 1.0.1-6 from CentOS, it looks like there is 1.0.1-7 which > includes an SHM crash fix, is this the one you were refering to, or is > there an additional patch elsewhere? > https://git.centos.org/commit/rpms!libqb.git/b5ede72cb0faf5b70ddd504822552fe97bfbbb5e >
Thanks. libqb-1.0.1-6 does have the patch I was thinking of - I mainly wanted to check it wasn't someone else's package that didn't have that patch in. The SHM patch in -7 fixes a race at shutdown (often seen with sbd). That shouldn't be a problem because there is a workaround in -6 anyway, and it's not fixing a spin, which is what we have here of course. Are there any messages in the system logs from either corosync or related subsystems? Chrissie >> Without doubt this is a bug, in normal operation corosync is quite light >> on CPU. > > Thanks for the help in advance, > --Edwin > >> >> Chrissie >> >>> This appears to be a priority inversion problem, if corosync runs as >>> realtime then everything it needs (timers...) should be realtime as >>> well, otherwise running as realtime guarantees we'll miss the watchdog >>> deadline, instead of guaranteeing that we process the data before the >>> deadline. >>> >>> Do you have some advice on what the expected realtime priorities would >>> be for: >>> - corosync >>> - sbd >>> - hard irqs >>> - soft irqs >>> >>> Also would it be possible for corosync to avoid hogging the CPU in libqb? >>> (Our hypothesis is that if softirqs are not processed then timers >>> wouldn't work for processes on that CPU either) >>> >>> Some more background and analysis: >>> >>> We noticed that cluster membership changes were very unstable >>> (especially as you approach 16 physical host clusters) causing the whole >>> cluster to fence when adding a single node. This was previously working >>> fairly reliably with a 4.4 based kernel. >>> >>> I've increased SBD timeout to 3600 to be able to investigate the problem >>> and noticed that corosync was using 100% of CPU3 [1] and I immediately >>> lost SSH access on eth0 (corosync was using eth1), where eth0's >>> interrupts were also processed on CPU3 (and irqbalance didn't move it). >>> >>> IIUC SCHED_RR tasks should not be able to take up 100% of CPU, according >>> to [3] it shouldn't be allowed to use more than 95% of CPU. >>> >>> Softirqs were not processed at all on CPU3 (see [2], the numbers in the >>> CPU3 column did not change, the numbers in the other columns did). >>> Tried decreasing priority of corosync using chrt to 1, which didn't >>> help. I then increased the priority of ksoftirqd to 50 using chrt, which >>> immediately solved the CPU usage problem on corosync. >>> >>> I tried a simple infinite loop program with realtime priority, but it >>> didn't reproduce the problems with interrupts getting stuck. >>> >>> >>> Three problems here: >>> * all softirqs were stuck (not being processed) on CPU3, which included >>> TIMER and SCHED. corosync relies quite heavily on timers, would lack of >>> processing them cause the 100% CPU usage? >>> * is there a kernel bug introduced between 4.4 - 4.19 that causes >>> realtime tasks to not respect the 95% limit anymore? This would leave 5% >>> time for IRQs, including NIC IRQs >>> * if corosync runs at higher priority than the kernel softirq thread >>> processing NIC IRQ how is corosync expecting incoming packets to be >>> processed, if it is hogging the CPU by receiving, polling and sending >>> packets? >>> >>> On another host which exhibited the same problem I've run strace (which >>> also had the side-effect of getting corosync unstuck from 100% CPU use >>> after strace finished): >>> 1 bind >>> 5 close >>> 688 epoll_wait >>> 8 futex >>> 1 getsockname >>> 3 ioctl >>> 1 open >>> 3 recvfrom >>> 190 recvmsg >>> 87 sendmsg >>> 9 sendto >>> 4 socket >>> 6 write >>> >>> On yet another host I've run gdb while corosync was stuck: >>> Thread 2 (Thread 0x7f6fd0c9b700 (LWP 16245)): >>> #0 0x00007f6fd34a0afb in do_futex_wait.constprop.1 () >>> from /lib64/libpthread.so.0 >>> #1 0x00007f6fd34a0b8f in __new_sem_wait_slow.constprop.0 () >>> from /lib64/libpthread.so.0 >>> #2 0x00007f6fd34a0c2b in sem_wait@@GLIBC_2.2.5 () from >>> /lib64/libpthread.so.0 >>> #3 0x00007f6fd3b38991 in qb_logt_worker_thread () from /lib64/libqb.so.0 >>> #4 0x00007f6fd349ae25 in start_thread () from /lib64/libpthread.so.0 >>> #5 0x00007f6fd31c4bad in clone () from /lib64/libc.so.6 >>> >>> Thread 1 (Thread 0x7f6fd43c7b80 (LWP 16242)): >>> #0 0x00007f6fd31c5183 in epoll_wait () from /lib64/libc.so.6 >>> #1 0x00007f6fd3b3dea8 in poll_and_add_to_jobs () from /lib64/libqb.so.0 >>> #2 0x00007f6fd3b2ed93 in qb_loop_run () from /lib64/libqb.so.0 >>> #3 0x000055592d62ff78 in main () >>> >>> >>> [1] >>> top - 15:51:38 up 47 min, 2 users, load average: 3.81, 1.70, 0.70 >>> Tasks: 208 total, 4 running, 130 sleeping, 0 stopped, 0 zombie >>> %Cpu0 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, >>> 0.0 st >>> %Cpu1 : 53.8 us, 46.2 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, >>> 0.0 st >>> %Cpu2 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, >>> 0.0 st >>> %Cpu3 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, >>> 0.0 st >>> %Cpu4 : 3.8 us, 0.0 sy, 0.0 ni, 96.2 id, 0.0 wa, 0.0 hi, 0.0 si, >>> 0.0 st >>> %Cpu5 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, >>> 0.0 st >>> %Cpu6 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, >>> 0.0 st >>> %Cpu7 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, >>> 0.0 st >>> KiB Mem : 4210184 total, 3374752 free, 387380 used, 448052 buff/cache >>> KiB Swap: 1048572 total, 1048572 free, 0 used. 3660276 avail Mem >>> >>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >>> 16217 root rt 0 194832 92532 66128 R 104.0 2.2 2:54.00 corosync >>> 24004 root 20 0 161984 4432 3696 R 4.0 0.1 0:00.02 top >>> >>> cat /proc/interrupts |grep eth >>> 88: 2 0 0 0 0 0 >>> 0 0 xen-pirq -msi-x eth0 >>> 89: 2805 0 0 419621 0 0 >>> 0 0 xen-pirq -msi-x eth0-TxRx-0 >>> 90: 2 0 0 0 0 0 >>> 1 0 xen-pirq -msi-x eth1 >>> 91: 435 171086 0 0 0 0 >>> 0 0 xen-pirq -msi-x eth1-TxRx-0 >>> >>> [2] >>> watch cat /proc/softirqs >>> CPU0 CPU1 CPU2 CPU3 CPU4 >>> CPU5 CPU6 CPU7 >>> HI: 1 0 0 0 0 >>> 0 0 0 >>> TIMER: 355339 786951 367148 266583 389591 >>> 357184 369577 374880 >>> NET_TX: 8 1081 1 743 1 >>> 1 223 5 >>> NET_RX: 5807 230818 2709 529755 3323 >>> 2598 2550 1878 >>> BLOCK: 8166 53 42 13 10 >>> 6563 21 6894 >>> IRQ_POLL: 0 0 0 0 0 >>> 0 0 0 >>> TASKLET: 103 6 9 85 0 >>> 0 0 0 >>> SCHED: 240428 100842 83166 51213 77824 >>> 77270 72074 72261 >>> HRTIMER: 0 0 0 0 0 >>> 0 0 0 >>> RCU: 135491 245345 139513 130591 140280 >>> 135094 141007 136063 >>> >>> >>> [3] >>> /proc/sys/kernel/sched_rt_runtime_us >>> 950000 >>> /proc/sys/kernel/sched_rt_period_us >>> 1000000 >>> >>> Best regards, >>> --Edwin >>> _______________________________________________ >>> Users mailing list: Users@clusterlabs.org >>> https://lists.clusterlabs.org/mailman/listinfo/users >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >>> >> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org