** Description changed:

  Problem:
-   During kexec/reboot on ARM64 Grace systems, a CSD lock timeout occurs when 
KFENCE's toggle_allocation_gate() calls kick_all_cpus_sync() while CPU#0 is 
stuck in
-   nbcon_atomic_flush_pending() with IRQs disabled. This causes system hangs 
or significant delays during kexec.
+   During kexec/reboot on ARM64 Grace systems, a CSD lock timeout occurs when 
CPU#0 is stuck in nbcon_atomic_flush_pending() with IRQs disabled. The pl011 
UART driver
+   performs an unbounded busy-wait for hardware synchronization while IRQs are 
disabled, blocking CPU#0 from responding to any CSD IPIs for extended periods 
(11+
+   seconds observed upstream).
  
-   The root cause is twofold:
-   1. nbcon_atomic_flush_pending() holds IRQs disabled for the entire console 
flush (including pl011 UART busy-wait), blocking CPU#0 from responding to CSD 
IPIs
-   2. KFENCE's toggle_allocation_gate() continues firing during shutdown, 
sending IPIs to all CPUs via kick_all_cpus_sync()
-   
https://lore.kernel.org/all/sqwajvt7utnt463tzxgwu2yctyn5m6bjwrslsnupfexeml6hkd@v6sqmpbu3vvu/
+ 
https://lore.kernel.org/all/sqwajvt7utnt463tzxgwu2yctyn5m6bjwrslsnupfexeml6hkd@v6sqmpbu3vvu/
  
    Reproduction:
-   Verified on a 176-CPU Grace system using a test module that simulates the 
nbcon_atomic_flush_pending() IRQ-off condition. With 
CONFIG_KFENCE_STATIC_KEYS=y and CONFIG_KFENCE_SAMPLE_INTERVAL=100:
+   Verified on a 176-CPU Grace system using a test module that simulates the 
nbcon_atomic_flush_pending() IRQ-off condition on CPU#0 during shutdown.
  
    Without fix:
    smp: csd: Detected non-responsive CSD lock (#1) on CPU#145, waiting 
5000000036 ns for CPU#00 do_nothing+0x0/0x10(0x0).
    smp:     csd: CSD lock (#1) unresponsive.
    Sending NMI from CPU 145 to CPUs 0:
  
-   With all three fixes applied: clean kexec, no CSD lock.
+   With fix applied: clean kexec, no CSD lock.
  
    Fix:
+   Please backport:
+   - 9bd18e1262c0 printk/nbcon: Restore IRQ in atomic flush after each emitted 
record
  
-   1. ce2bba89566b mm/kfence: add reboot notifier to disable KFENCE on shutdown
-     - Fixes: 0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure")
-   2. 9bc9ccbf4c93 mm/kfence: fix potential deadlock in reboot notifier
-     - Fixes: ce2bba89566b ("mm/kfence: add reboot notifier to disable KFENCE 
on shutdown")
-   3. 9bd18e1262c0 printk/nbcon: Restore IRQ in atomic flush after each 
emitted record
-   
- We can ignore the kfence commits as we don't enable that.
+   This patch restores IRQs between each record in nbcon atomic flush, 
reducing the IRQ-off window on CPU#0 so it can respond to IPIs. It has a minor 
conflict in
+   kernel/printk/nbcon.c (allow_unsafe_takeover parameter difference) — 
resolution is straightforward.
+ 
+   Note: The upstream fix also includes two kfence patches (ce2bba89566b and 
9bc9ccbf4c93) but those are not needed for our kernel since 
CONFIG_KFENCE_STATIC_KEYS is
+   disabled in our config

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2146955

Title:
  CSD lock timeout during kexec/reboot when KFENCE is enabled

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-nvidia-6.17/+bug/2146955/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to