On Wed, Sep 05, 2018 at 08:29:07PM +0200, Jiri Kosina wrote: > (and no, my testing of the patch I sent on current tree didn't produce any > hangs -- was there a reliable way to trigger it on 3.10?).
Only a very specific libvirt acceptance test found this after a while and it wasn't a customer it was caught by QA. The reporter said it wasn't sure about how to reproduce this issue either, it happened once in a while the backtrace was still enough to fix it for sure and then it never happened again. It's not because of virt but probably because of selinux+audit. This is precisely why I thought once you enter LSM from the scheduler atomic path the trouble starts as each LSM implementation of those calls may crash or not crash. Perhaps you didn't sandbox KVM inside selinux by default? This is the lockup the patch I posted fixed for 3.10. [ 1838.621010] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 6 [ 1838.629070] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 3.10.0-327.62.4.el7.x86_64 #1 [ 1838.637610] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.4.2 01/09/2017 [ 1838.645954] Call Trace: [ 1838.648680] <NMI> [<ffffffff8163a05d>] dump_stack+0x19/0x1b [ 1838.655113] [<ffffffff816338d0>] panic+0xd8/0x1e7 [ 1838.660460] [<ffffffff8111e960>] ? restart_watchdog_hrtimer+0x50/0x50 [ 1838.667742] [<ffffffff8111ea22>] watchdog_overflow_callback+0xc2/0xd0 [ 1838.675024] [<ffffffff81162211>] __perf_event_overflow+0xa1/0x250 [ 1838.681920] [<ffffffff81162ce4>] perf_event_overflow+0x14/0x20 [ 1838.688526] [<ffffffff810337c8>] intel_pmu_handle_irq+0x1e8/0x470 [ 1838.695423] [<ffffffff812f83cc>] ? ioremap_page_range+0x24c/0x330 [ 1838.702320] [<ffffffff811a9031>] ? unmap_kernel_range_noflush+0x11/0x20 [ 1838.709797] [<ffffffff813997f4>] ? ghes_copy_tofrom_phys+0x124/0x210 [ 1838.716984] [<ffffffff81399980>] ? ghes_read_estatus+0xa0/0x190 [ 1838.723687] [<ffffffff816444bb>] perf_event_nmi_handler+0x2b/0x50 [ 1838.730582] [<ffffffff81643c09>] nmi_handle.isra.0+0x69/0xb0 [ 1838.736992] [<ffffffff81643db9>] do_nmi+0x169/0x340 [ 1838.742532] [<ffffffff81642ff9>] end_repeat_nmi+0x1e/0x7e [ 1838.748653] [<ffffffff81641bbd>] ? _raw_spin_lock_irqsave+0x3d/0x60 [ 1838.755742] [<ffffffff81641bbd>] ? _raw_spin_lock_irqsave+0x3d/0x60 [ 1838.762831] [<ffffffff81641bbd>] ? _raw_spin_lock_irqsave+0x3d/0x60 [ 1838.769917] <<EOE>> [<ffffffff816391e5>] avc_compute_av+0x126/0x1b5 [ 1838.777125] [<ffffffff810b842e>] ? walk_tg_tree_from+0xbe/0x110 [ 1838.783828] [<ffffffff8128b9c4>] avc_has_perm_noaudit+0xc4/0x110 [ 1838.790628] [<ffffffff8128f1fb>] cred_has_capability+0x6b/0x120 [ 1838.797331] [<ffffffff810db71c>] ? ktime_get+0x4c/0xd0 [ 1838.803160] [<ffffffff810e167b>] ? clockevents_program_event+0x6b/0xf0 [ 1838.810532] [<ffffffff8128f2de>] selinux_capable+0x2e/0x40 [ 1838.816748] [<ffffffff81288f65>] security_capable_noaudit+0x15/0x20 [ 1838.823829] [<ffffffff8108b975>] has_ns_capability_noaudit+0x15/0x20 [ 1838.831014] [<ffffffff8108bc55>] ptrace_has_cap+0x35/0x40 [ 1838.837126] [<ffffffff8108c717>] ___ptrace_may_access+0xa7/0x1e0 [ 1838.843925] [<ffffffff8163f0ae>] __schedule+0x26e/0xa00 [ 1838.849855] [<ffffffff81640949>] schedule_preempt_disabled+0x29/0x70 [ 1838.857041] [<ffffffff810d9324>] cpu_startup_entry+0x184/0x290 [ 1838.863637] [<ffffffff8104891a>] start_secondary+0x1da/0x250