Hi, We are seeing hard lockup warnings caused by CFS bandwidth control code. The test case below fails almost immediately on a reasonably large (144 thread) POWER9 guest with:
watchdog: CPU 80 Hard LOCKUP watchdog: CPU 80 TB:1134131922788, last heartbeat TB:1133207948315 (1804ms ago) Modules linked in: CPU: 80 PID: 0 Comm: swapper/80 Tainted: G L 4.20.0-rc4-00156-g94f371cb7394-dirty #98 NIP: c00000000018f618 LR: c000000000185174 CTR: c00000000018f5f0 REGS: c00000000fbbbd70 TRAP: 0100 Tainted: G L (4.20.0-rc4-00156-g94f371cb7394-dirty) MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 28000222 XER: 00000000 CFAR: c000000000002440 IRQMASK: 1 GPR00: c000000000185174 c000003fef927610 c0000000010bd500 c000003fab1dbb80 GPR04: c000003ffe2d3000 c00000000018f5f0 c000003ffe2d3000 00000076e60d19fe GPR08: c000003fab1dbb80 0000000000000178 c000003fa722f800 0000000000000001 GPR12: c00000000018f5f0 c00000000ffb3700 c000003fef927f90 0000000000000000 GPR16: 0000000000000000 c000000000f8d468 0000000000000050 c00000000004ace0 GPR20: c000003ffe743260 0000000000002a61 0000000000000001 0000000000000000 GPR24: 00000076e61c5aa0 000000003b9aca00 0000000000000000 c00000000017cdb0 GPR28: c000003fc2290000 c000003ffe2d3000 c00000000018f5f0 c000003fa74ca800 NIP [c00000000018f618] tg_unthrottle_up+0x28/0xc0 LR [c000000000185174] walk_tg_tree_from+0x94/0x120 Call Trace: [c000003fef927610] [c000003fe3ad5000] 0xc000003fe3ad5000 (unreliable) [c000003fef927690] [c00000000004b8ac] smp_muxed_ipi_message_pass+0x5c/0x70 [c000003fef9276e0] [c00000000019d828] unthrottle_cfs_rq+0xe8/0x300 [c000003fef927770] [c00000000019dc80] distribute_cfs_runtime+0x160/0x1d0 [c000003fef927820] [c00000000019e044] sched_cfs_period_timer+0x154/0x2f0 [c000003fef9278a0] [c0000000001f8fc0] __hrtimer_run_queues+0x180/0x430 [c000003fef927920] [c0000000001fa2a0] hrtimer_interrupt+0x110/0x300 [c000003fef9279d0] [c0000000000291d4] timer_interrupt+0x104/0x2e0 [c000003fef927a30] [c000000000009028] decrementer_common+0x108/0x110 Adding CPUs, or adding empty cgroups makes the situation worse. We haven't had a chance to dig deeper yet. Note: The test case makes no attempt to clean up after itself and sometimes takes my guest down :) Thanks, Anton -- #!/bin/bash -e echo 1 > /proc/sys/kernel/nmi_watchdog echo 1 > /proc/sys/kernel/watchdog_thresh mkdir -p /sys/fs/cgroup/cpu/base_cgroup echo 1000 > /sys/fs/cgroup/cpu/base_cgroup/cpu.cfs_period_us echo 1000000 > /sys/fs/cgroup/cpu/base_cgroup/cpu.cfs_quota_us # Create some empty cgroups for i in $(seq 1 1024) do mkdir -p /sys/fs/cgroup/cpu/base_cgroup/$i done # Create some cgroups with a CPU soaker for i in $(seq 1 144) do (while :; do :; done ) & PID=$! mkdir -p /sys/fs/cgroup/cpu/base_cgroup/$PID echo $PID > /sys/fs/cgroup/cpu/base_cgroup/$PID/cgroup.procs done