** Description changed: Problem: - During kexec/reboot on ARM64 Grace systems, a CSD lock timeout occurs when KFENCE's toggle_allocation_gate() calls kick_all_cpus_sync() while CPU#0 is stuck in - nbcon_atomic_flush_pending() with IRQs disabled. This causes system hangs or significant delays during kexec. + During kexec/reboot on ARM64 Grace systems, a CSD lock timeout occurs when CPU#0 is stuck in nbcon_atomic_flush_pending() with IRQs disabled. The pl011 UART driver + performs an unbounded busy-wait for hardware synchronization while IRQs are disabled, blocking CPU#0 from responding to any CSD IPIs for extended periods (11+ + seconds observed upstream). - The root cause is twofold: - 1. nbcon_atomic_flush_pending() holds IRQs disabled for the entire console flush (including pl011 UART busy-wait), blocking CPU#0 from responding to CSD IPIs - 2. KFENCE's toggle_allocation_gate() continues firing during shutdown, sending IPIs to all CPUs via kick_all_cpus_sync() - https://lore.kernel.org/all/sqwajvt7utnt463tzxgwu2yctyn5m6bjwrslsnupfexeml6hkd@v6sqmpbu3vvu/ + https://lore.kernel.org/all/sqwajvt7utnt463tzxgwu2yctyn5m6bjwrslsnupfexeml6hkd@v6sqmpbu3vvu/ Reproduction: - Verified on a 176-CPU Grace system using a test module that simulates the nbcon_atomic_flush_pending() IRQ-off condition. With CONFIG_KFENCE_STATIC_KEYS=y and CONFIG_KFENCE_SAMPLE_INTERVAL=100: + Verified on a 176-CPU Grace system using a test module that simulates the nbcon_atomic_flush_pending() IRQ-off condition on CPU#0 during shutdown. Without fix: smp: csd: Detected non-responsive CSD lock (#1) on CPU#145, waiting 5000000036 ns for CPU#00 do_nothing+0x0/0x10(0x0). smp: csd: CSD lock (#1) unresponsive. Sending NMI from CPU 145 to CPUs 0: - With all three fixes applied: clean kexec, no CSD lock. + With fix applied: clean kexec, no CSD lock. Fix: + Please backport: + - 9bd18e1262c0 printk/nbcon: Restore IRQ in atomic flush after each emitted record - 1. ce2bba89566b mm/kfence: add reboot notifier to disable KFENCE on shutdown - - Fixes: 0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure") - 2. 9bc9ccbf4c93 mm/kfence: fix potential deadlock in reboot notifier - - Fixes: ce2bba89566b ("mm/kfence: add reboot notifier to disable KFENCE on shutdown") - 3. 9bd18e1262c0 printk/nbcon: Restore IRQ in atomic flush after each emitted record - - We can ignore the kfence commits as we don't enable that. + This patch restores IRQs between each record in nbcon atomic flush, reducing the IRQ-off window on CPU#0 so it can respond to IPIs. It has a minor conflict in + kernel/printk/nbcon.c (allow_unsafe_takeover parameter difference) — resolution is straightforward. + + Note: The upstream fix also includes two kfence patches (ce2bba89566b and 9bc9ccbf4c93) but those are not needed for our kernel since CONFIG_KFENCE_STATIC_KEYS is + disabled in our config
-- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2146955 Title: CSD lock timeout during kexec/reboot when KFENCE is enabled To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-nvidia-6.17/+bug/2146955/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
