** Summary changed:

- Using a 6.8 kernel modprobe nvidia hangs on Grace Hopper
+ Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper

** Also affects: nvidia-graphics-drivers-535-server (Ubuntu)
   Importance: Undecided
       Status: New

** Changed in: nvidia-graphics-drivers-535-server (Ubuntu)
       Status: New => Confirmed

** Changed in: nvidia-graphics-drivers-550-server (Ubuntu)
       Status: New => Confirmed

** Description changed:

  Using both -generic and -nvidia 6.8 kernels I'm seeing a hang when I
  load the nvidia driver.
+ 
+ $ sudo dmidecode -t 0
+ # dmidecode 3.5
+ Getting SMBIOS data from sysfs.
+ SMBIOS 3.6.0 present.
+ # SMBIOS implementations newer than version 3.5.0 are not
+ # fully supported by this version of dmidecode.
+ 
+ Handle 0x0001, DMI type 0, 26 bytes
+ BIOS Information
+       Vendor: NVIDIA
+       Version:         01.02.01
+       Release Date: 20240207
+       ROM Size: 64 MB
+       Characteristics:
+               PCI is supported
+               PNP is supported
+               BIOS is upgradeable
+               BIOS shadowing is allowed
+               Boot from CD is supported
+               Selectable boot is supported
+               Serial services are supported (int 14h)
+               ACPI is supported
+               Targeted content distribution is supported
+               UEFI is supported
+       Firmware Revision: 0.0
  
  [  382.938326] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
  [  382.946075] rcu:     53-...0: (4 ticks this GP) 
idle=1c2c/1/0x4000000000000000 softirq=4866/4868 fqs=14124
  [  382.955683] rcu:              hardirqs   softirqs   csw/system
  [  382.961378] rcu:      number:        0          0            0
  [  382.967071] rcu:     cputime:        0          0            0   ==> 
30026(ms)
  [  382.974189] rcu:     (detected by 52, t=60034 jiffies, g=24469, q=1199 
ncpus=72)
  [  392.982095] rcu: rcu_preempt kthread starved for 9994 jiffies! g24469 f0x0 
RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=31
  [  392.992769] rcu:     Unless rcu_preempt kthread gets sufficient CPU time, 
OOM is now expected behavior
  
- 
  After seeing this, I Enabled kdump and set kernel.panic_on_rcu_stall = 1
  
  KDUMP INFO
  WARNING: cpu 54: cannot find NT_PRSTATUS note
-       KERNEL: /usr/lib/debug/boot/vmlinux-6.8.0-1004-nvidia-64k  [TAINTED]
-     DUMPFILE: /var/crash/202404172139/dump.202404172139  [PARTIAL DUMP]
-         CPUS: 72
-         DATE: Wed Apr 17 21:39:13 UTC 2024
-       UPTIME: 00:06:10
+       KERNEL: /usr/lib/debug/boot/vmlinux-6.8.0-1004-nvidia-64k  [TAINTED]
+     DUMPFILE: /var/crash/202404172139/dump.202404172139  [PARTIAL DUMP]
+         CPUS: 72
+         DATE: Wed Apr 17 21:39:13 UTC 2024
+       UPTIME: 00:06:10
  LOAD AVERAGE: 0.68, 0.63, 0.28
-        TASKS: 854
-     NODENAME: hinyari
-      RELEASE: 6.8.0-1005-nvidia-64k
-      VERSION: #5-Ubuntu SMP PREEMPT_DYNAMIC Wed Apr 17 11:26:46 UTC 2024
-      MACHINE: aarch64  (unknown Mhz)
-       MEMORY: 479.7 GB
-        PANIC: "Kernel panic - not syncing: RCU Stall"
-          PID: 0
-      COMMAND: "swapper/21"
-         TASK: ffff000082026880  (1 of 72)  [THREAD_INFO: ffff000082026880]
-          CPU: 21
-        STATE: TASK_RUNNING (PANIC)
+        TASKS: 854
+     NODENAME: hinyari
+      RELEASE: 6.8.0-1005-nvidia-64k
+      VERSION: #5-Ubuntu SMP PREEMPT_DYNAMIC Wed Apr 17 11:26:46 UTC 2024
+      MACHINE: aarch64  (unknown Mhz)
+       MEMORY: 479.7 GB
+        PANIC: "Kernel panic - not syncing: RCU Stall"
+          PID: 0
+      COMMAND: "swapper/21"
+         TASK: ffff000082026880  (1 of 72)  [THREAD_INFO: ffff000082026880]
+          CPU: 21
+        STATE: TASK_RUNNING (PANIC)
  
  [  300.313144] nvidia: loading out-of-tree module taints kernel.
  [  300.313153] nvidia: module verification failed: signature and/or required 
key missing - tainting kernel
  [  300.316694] nvidia-nvlink: Nvlink Core is being initialized, major device 
number 506
- [  300.316699] 
+ [  300.316699]
  [  360.323454] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
  [  360.331206] rcu:     54-...0: (24 ticks this GP) 
idle=742c/1/0x4000000000000000 softirq=4931/4933 fqs=13148
  [  360.340903] rcu:              hardirqs   softirqs   csw/system
  [  360.346597] rcu:      number:        0          0            0
  [  360.352291] rcu:     cputime:        0          0            0   ==> 
30031(ms)
  [  360.359408] rcu:     (detected by 21, t=60038 jiffies, g=25009, q=1123 
ncpus=72)
  [  360.366704] Sending NMI from CPU 21 to CPUs 54:
  [  370.367310] rcu: rcu_preempt kthread starved for 9993 jiffies! g25009 f0x0 
RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=31
  [  370.377983] rcu:     Unless rcu_preempt kthread gets sufficient CPU time, 
OOM is now expected behavior.
  [  370.387322] rcu: RCU grace-period kthread stack dump:
  [  370.392482] task:rcu_preempt     state:I stack:0     pid:17    tgid:17    
ppid:2      flags:0x00000008
  [  370.392488] Call trace:
  [  370.392489]  __switch_to+0xd0/0x118
  [  370.392499]  __schedule+0x2a8/0x7b0
  [  370.392501]  schedule+0x40/0x168
  [  370.392502]  schedule_timeout+0xac/0x1e0
  [  370.392505]  rcu_gp_fqs_loop+0x128/0x508
  [  370.392512]  rcu_gp_kthread+0x150/0x188
  [  370.392514]  kthread+0xf8/0x110
  [  370.392519]  ret_from_fork+0x10/0x20
  [  370.392524] rcu: Stack dump where RCU GP kthread last ran:
  [  370.398128] Sending NMI from CPU 21 to CPUs 31:
  [  370.398131] NMI backtrace for cpu 31
  [  370.398136] CPU: 31 PID: 0 Comm: swapper/31 Kdump: loaded Tainted: G       
    OE      6.8.0-1005-nvidia-64k #5-Ubuntu
  [  370.398139] Hardware name:  /P3880, BIOS         01.02.01 20240207
  [  370.398140] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
  [  370.398142] pc : cpuidle_enter_state+0xd8/0x790
  [  370.398150] lr : cpuidle_enter_state+0xcc/0x790
  [  370.398153] sp : ffff800081eefd70
  [  370.398154] x29: ffff800081eefd70 x28: 0000000000000000 x27: 
0000000000000000
  [  370.398157] x26: 0000000000000000 x25: 000000563d67e4e0 x24: 
0000000000000000
  [  370.398160] x23: ffffa0a1445699f8 x22: 0000000000000000 x21: 
000000563d72ece0
  [  370.398162] x20: ffffa0a144569a10 x19: ffff00008fa4a800 x18: 
ffff800081f00030
  [  370.398165] x17: 0000000000000000 x16: 0000000000000000 x15: 
0000ac8c73b08db0
  [  370.398168] x14: 0000000000000000 x13: 0000000000000000 x12: 
0000000000000000
  [  370.398170] x11: 0000000000000000 x10: 2da0fbe3d5e8c649 x9 : 
ffffa0a1424fd244
  [  370.398173] x8 : ffff0000820559b8 x7 : 0000000000000000 x6 : 
0000000000000000
  [  370.398175] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 
0000000000000000
  [  370.398178] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 
0000000000000000
  [  370.398181] Call trace:
  [  370.398183]  cpuidle_enter_state+0xd8/0x790
  [  370.398185]  cpuidle_enter+0x44/0x78
  [  370.398195]  cpuidle_idle_call+0x15c/0x210
  [  370.398202]  do_idle+0xb0/0x130
  [  370.398204]  cpu_startup_entry+0x40/0x50
  [  370.398206]  secondary_start_kernel+0xec/0x130
  [  370.398211]  __secondary_switched+0xc0/0xc8
  [  370.399132] Kernel panic - not syncing: RCU Stall
  [  370.403938] CPU: 21 PID: 0 Comm: swapper/21 Kdump: loaded Tainted: G       
    OE      6.8.0-1005-nvidia-64k #5-Ubuntu
  [  370.414876] Hardware name:  /P3880, BIOS         01.02.01 20240207
  [  370.421192] Call trace:
  [  370.423686]  dump_backtrace+0xa4/0x150
  [  370.427514]  show_stack+0x24/0x50
  [  370.430896]  dump_stack_lvl+0x78/0xf8
  [  370.434640]  dump_stack+0x1c/0x38
  [  370.438023]  panic+0x3a4/0x440
  [  370.441141]  print_other_cpu_stall+0x578/0x610
  [  370.445681]  check_cpu_stall+0x240/0x300
  [  370.449686]  rcu_pending+0x44/0x220
  [  370.453248]  rcu_sched_clock_irq+0x7c/0x2c8
  [  370.457519]  update_process_times+0x7c/0xf8
  [  370.461794]  tick_sched_handle+0x3c/0x98
  [  370.465803]  tick_nohz_highres_handler+0x5c/0xe8
  [  370.470520]  __hrtimer_run_queues+0x164/0x398
  [  370.474969]  hrtimer_interrupt+0xf4/0x278
  [  370.479063]  arch_timer_handler_phys+0x38/0x80
  [  370.483607]  handle_percpu_devid_irq+0x94/0x2b8
  [  370.488238]  generic_handle_domain_irq+0x38/0x70
  [  370.492954]  __gic_handle_irq_from_irqson.isra.0+0x180/0x310
  [  370.498743]  gic_handle_irq+0x2c/0xa0
  [  370.502481]  call_on_irq_stack+0x3c/0x50
  [  370.506486]  do_interrupt_handler+0xb0/0xc8
  [  370.510759]  el1_interrupt+0x48/0xf0
  [  370.514409]  el1h_64_irq_handler+0x1c/0x40
  [  370.518592]  el1h_64_irq+0x7c/0x80
  [  370.522063]  cpuidle_enter_state+0xd8/0x790
  [  370.526336]  cpuidle_enter+0x44/0x78
  [  370.529986]  cpuidle_idle_call+0x15c/0x210
  [  370.534169]  do_idle+0xb0/0x130
  [  370.537375]  cpu_startup_entry+0x44/0x50
  [  370.541380]  secondary_start_kernel+0xec/0x130
  [  370.545919]  __secondary_switched+0xc0/0xc8
  [  370.550197] SMP: stopping secondary CPUs
  [  371.601076] SMP: failed to stop secondary CPUs 0-20,22-71
  [  371.607097] Starting crashdump kernel...
  [  371.611103] ------------[ cut here ]------------
  [  371.615820] Some CPUs may be stale, kdump will be unreliable.
  [  371.621695] WARNING: CPU: 21 PID: 0 at 
arch/arm64/kernel/machine_kexec.c:174 machine_kexec+0x48/0x1f0
  [  371.631124] Modules linked in: nvidia(OE+) ecc qrtr cfg80211 binfmt_misc 
dax_hmem cxl_acpi cxl_core nvidia_cspmu acpi_ipmi ast cdc_ether cdc_subset 
arm_smmuv3_pmu arm_cspmu_module coresight_trbe usbnet arm_spe_pmu ipmi_ssif 
i2c_algo_bit uio_pdrv_genirq uio spi_nor ipmi_devintf ipmi_msghandler 
nls_iso8859_1 stm_p_basic coresight_stm stm_core cppc_cpufreq coresight_etm4x 
coresight_tmc coresight_funnel acpi_power_meter coresight dm_multipath 
efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic 
raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor 
xor_neon raid6_pq libcrc32c raid1 raid0 mlx5_ib ib_uverbs macsec ib_core 
mlx5_dpll crct10dif_ce mlx5_core polyval_ce polyval_generic ghash_ce sm4_ce_gcm 
sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce mlxfw sm3 nvme psample sha3_ce 
i2c_smbus sha2_ce nvme_core tls sha256_arm64 xhci_pci sha1_ce xhci_pci_renesas 
pci_hyperv_intf nvme_auth i2c_tegra aes_neon_bs aes_neon_blk aes_ce_blk 
aes_ce_cipher
  [  371.719810] CPU: 21 PID: 0 Comm: swapper/21 Kdump: loaded Tainted: G       
    OE      6.8.0-1005-nvidia-64k #5-Ubuntu
  [  371.730748] Hardware name:  /P3880, BIOS         01.02.01 20240207
  [  371.737064] pstate: 634000c9 (nZCv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
  [  371.744180] pc : machine_kexec+0x48/0x1f0
  [  371.748275] lr : machine_kexec+0x48/0x1f0
  [  371.752369] sp : ffff8000802afa10
  [  371.755751] x29: ffff8000802afa10 x28: 0000000000000463 x27: 
000000000000003c
  [  371.763047] x26: 00000000000000c0 x25: 0000000000000280 x24: 
ffffa0a144268cb4
  [  371.770341] x23: ffffa0a14439f540 x22: ffffa0a1447cf4c0 x21: 
ffffa0a14481a000
  [  371.777636] x20: ffff0000d987e000 x19: ffff0000d987e000 x18: 
ffff800080ba0088
  [  371.784930] x17: 0000000000000000 x16: 0000000000000000 x15: 
0000000000000463
  [  371.792225] x14: 0000000000000000 x13: 2e656c6261696c65 x12: 
726e75206562206c
  [  371.799519] x11: 6c697720706d7564 x10: 0000000000000000 x9 : 
0000000000000000
  [  371.806814] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 
0000000000000000
  [  371.814108] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 
0000000000000000
  [  371.821402] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 
0000000000000000
  [  371.828696] Call trace:
  [  371.831189]  machine_kexec+0x48/0x1f0
  [  371.834928]  __crash_kexec+0x94/0x128
  [  371.838668]  panic+0x380/0x440
  [  371.841784]  print_other_cpu_stall+0x578/0x610
  [  371.846325]  check_cpu_stall+0x240/0x300
  [  371.850331]  rcu_pending+0x44/0x220
  [  371.853892]  rcu_sched_clock_irq+0x7c/0x2c8
  [  371.858163]  update_process_times+0x7c/0xf8
  [  371.862434]  tick_sched_handle+0x3c/0x98
  [  371.866440]  tick_nohz_highres_handler+0x5c/0xe8
  [  371.871156]  __hrtimer_run_queues+0x164/0x398
  [  371.875605]  hrtimer_interrupt+0xf4/0x278
  [  371.879700]  arch_timer_handler_phys+0x38/0x80
  [  371.884240]  handle_percpu_devid_irq+0x94/0x2b8
  [  371.888869]  generic_handle_domain_irq+0x38/0x70
  [  371.893585]  __gic_handle_irq_from_irqson.isra.0+0x180/0x310
  [  371.899368]  gic_handle_irq+0x2c/0xa0
  [  371.903105]  call_on_irq_stack+0x3c/0x50
  [  371.907110]  do_interrupt_handler+0xb0/0xc8
  [  371.911382]  el1_interrupt+0x48/0xf0
  [  371.915032]  el1h_64_irq_handler+0x1c/0x40
  [  371.919215]  el1h_64_irq+0x7c/0x80
  [  371.922686]  cpuidle_enter_state+0xd8/0x790
  [  371.926958]  cpuidle_enter+0x44/0x78
  [  371.930609]  cpuidle_idle_call+0x15c/0x210
  [  371.934793]  do_idle+0xb0/0x130
  [  371.937998]  cpu_startup_entry+0x44/0x50
  [  371.942003]  secondary_start_kernel+0xec/0x130
  [  371.946542]  __secondary_switched+0xc0/0xc8
  [  371.950815] ---[ end trace 0000000000000000 ]---
  
- 
  In an attempt to get more debug info, I tried the open driver in github
  Using https://github.com/NVIDIA/open-gpu-kernel-modules
  Version 550.76- loads successfully
  Version 550.67- loads successfully
- Version 550.54.15 - crashes - which is the same version as the 550 package 
that hangs.  Below is the crash info.  What is interesting is that in an 
attempt to capture more debug into I changed optimization in utils.mk from -O2 
to -O0 and the crash went away.  It also doesn't happen with -O1.  
+ Version 550.54.15 - crashes - which is the same version as the 550 package 
that hangs.  Below is the crash info.  What is interesting is that in an 
attempt to capture more debug into I changed optimization in utils.mk from -O2 
to -O0 and the crash went away.  It also doesn't happen with -O1.
  
  CRASH INFO
  [ 8648.399518] nvidia-nvlink: Nvlink Core is being initialized, major device 
number 506
- [ 8648.399560] 
+ [ 8648.399560]
  [ 8648.399718] Internal error: Oops - FPAC: 0000000072000000 [#1] SMP
  [ 8648.407556] Modules linked in: nvidia(OE+) ecdh_generic ecc qrtr cfg80211 
binfmt_misc dax_hmem cxl_acpi cxl_core nvidia_cspmu arm_smmuv3_pmu 
arm_cspmu_module coresight_trbe arm_spe_pmu acpi_ipmi ast cdc_ether cdc_subset 
ipmi_ssif usbnet i2c_algo_bit uio_pdrv_genirq uio spi_nor ipmi_devintf 
ipmi_msghandler nls_iso8859_1 stm_p_basic coresight_stm stm_core cppc_cpufreq 
coresight_etm4x coresight_tmc coresight_funnel acpi_power_meter coresight 
dm_multipath efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs 
blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq 
async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 mlx5_ib 
ib_uverbs macsec ib_core mlx5_dpll mlx5_core crct10dif_ce polyval_ce 
polyval_generic ghash_ce sm4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce 
sm3 mlxfw i2c_smbus nvme psample sha3_ce sha2_ce nvme_core tls sha256_arm64 
xhci_pci sha1_ce xhci_pci_renesas pci_hyperv_intf nvme_auth i2c_tegra 
aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [last unloaded: nvidia(OE)]
- [ 8648.407608] 
+ [ 8648.407608]
  [ 8648.501397] CPU: 5 PID: 48130 Comm: insmod Kdump: loaded Tainted: G        
   OE      6.8.0-1004-nvidia-64k #4
  [ 8648.511625] Hardware name:  /P3880, BIOS         01.02.01 20240207
  [ 8648.517941] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
  [ 8648.525058] pc : __kmalloc+0x1e0/0x490
  [ 8648.528892] lr : 0xffffa00000000000
  [ 8648.532482] sp : ffff8000d132f5f0
  [ 8648.535864] x29: ffff8000d132f5f0 x28: 0000000000000000 x27: 
ffffa00084d50484
  [ 8648.543159] x26: 00000000000001f8 x25: 0000000000aa1d70 x24: 
ffff0000c2aba828
  [ 8648.550454] x23: ffffa00085026380 x22: ffff80009d3e0020 x21: 
ffff8000d132f7c8
  [ 8648.557749] x20: 0000000000000038 x19: ffff8000d132f628 x18: 
ffff8000d132f5e4
  [ 8648.565043] x17: 0000000000000000 x16: 0000000000000000 x15: 
0000000000000004
  [ 8648.572337] x14: 0000000000000000 x13: 0000000000000000 x12: 
0000000000000000
  [ 8648.579632] x11: 0000000000000000 x10: ffff8000d132f670 x9 : 
ffffa000806f73ec
  [ 8648.586926] x8 : ffff0000c2a98240 x7 : 0000000000000000 x6 : 
0000000000000000
  [ 8648.594221] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 
0000000000000000
  [ 8648.601516] x2 : 0000000000000000 x1 : ffff000100084480 x0 : 
ffff0000c2a98200
  [ 8648.608810] Call trace:
  [ 8648.611305]  __kmalloc+0x1e0/0x490
  [ 8648.614778]  0x8000604466e4a000
- [ 8648.617986] Code: a9435bf5 a94463f7 910183ff f85f8e5e (d50323bf) 
+ [ 8648.617986] Code: a9435bf5 a94463f7 910183ff f85f8e5e (d50323bf)
  [ 8648.624219] SMP: stopping secondary CPUs

** Description changed:

  Using both -generic and -nvidia 6.8 kernels I'm seeing a hang when I
  load the nvidia driver.
  
  $ sudo dmidecode -t 0
  # dmidecode 3.5
  Getting SMBIOS data from sysfs.
  SMBIOS 3.6.0 present.
  # SMBIOS implementations newer than version 3.5.0 are not
  # fully supported by this version of dmidecode.
  
  Handle 0x0001, DMI type 0, 26 bytes
  BIOS Information
-       Vendor: NVIDIA
-       Version:         01.02.01
-       Release Date: 20240207
-       ROM Size: 64 MB
-       Characteristics:
-               PCI is supported
-               PNP is supported
-               BIOS is upgradeable
-               BIOS shadowing is allowed
-               Boot from CD is supported
-               Selectable boot is supported
-               Serial services are supported (int 14h)
-               ACPI is supported
-               Targeted content distribution is supported
-               UEFI is supported
-       Firmware Revision: 0.0
- 
+  Vendor: NVIDIA
+  Version:         01.02.01
+  Release Date: 20240207
+  ROM Size: 64 MB
+  Characteristics:
+   PCI is supported
+   PNP is supported
+   BIOS is upgradeable
+   BIOS shadowing is allowed
+   Boot from CD is supported
+   Selectable boot is supported
+   Serial services are supported (int 14h)
+   ACPI is supported
+   Targeted content distribution is supported
+   UEFI is supported
+  Firmware Revision: 0.0
+ 
+ CONSOLE RCU STALL MESSAGE:
  [  382.938326] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
  [  382.946075] rcu:     53-...0: (4 ticks this GP) 
idle=1c2c/1/0x4000000000000000 softirq=4866/4868 fqs=14124
  [  382.955683] rcu:              hardirqs   softirqs   csw/system
  [  382.961378] rcu:      number:        0          0            0
  [  382.967071] rcu:     cputime:        0          0            0   ==> 
30026(ms)
  [  382.974189] rcu:     (detected by 52, t=60034 jiffies, g=24469, q=1199 
ncpus=72)
  [  392.982095] rcu: rcu_preempt kthread starved for 9994 jiffies! g24469 f0x0 
RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=31
  [  392.992769] rcu:     Unless rcu_preempt kthread gets sufficient CPU time, 
OOM is now expected behavior
  
  After seeing this, I Enabled kdump and set kernel.panic_on_rcu_stall = 1
  
- KDUMP INFO
+ KDUMP INFO:
  WARNING: cpu 54: cannot find NT_PRSTATUS note
        KERNEL: /usr/lib/debug/boot/vmlinux-6.8.0-1004-nvidia-64k  [TAINTED]
      DUMPFILE: /var/crash/202404172139/dump.202404172139  [PARTIAL DUMP]
          CPUS: 72
          DATE: Wed Apr 17 21:39:13 UTC 2024
        UPTIME: 00:06:10
  LOAD AVERAGE: 0.68, 0.63, 0.28
         TASKS: 854
      NODENAME: hinyari
       RELEASE: 6.8.0-1005-nvidia-64k
       VERSION: #5-Ubuntu SMP PREEMPT_DYNAMIC Wed Apr 17 11:26:46 UTC 2024
       MACHINE: aarch64  (unknown Mhz)
        MEMORY: 479.7 GB
         PANIC: "Kernel panic - not syncing: RCU Stall"
           PID: 0
       COMMAND: "swapper/21"
          TASK: ffff000082026880  (1 of 72)  [THREAD_INFO: ffff000082026880]
           CPU: 21
         STATE: TASK_RUNNING (PANIC)
  
  [  300.313144] nvidia: loading out-of-tree module taints kernel.
  [  300.313153] nvidia: module verification failed: signature and/or required 
key missing - tainting kernel
  [  300.316694] nvidia-nvlink: Nvlink Core is being initialized, major device 
number 506
  [  300.316699]
  [  360.323454] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
  [  360.331206] rcu:     54-...0: (24 ticks this GP) 
idle=742c/1/0x4000000000000000 softirq=4931/4933 fqs=13148
  [  360.340903] rcu:              hardirqs   softirqs   csw/system
  [  360.346597] rcu:      number:        0          0            0
  [  360.352291] rcu:     cputime:        0          0            0   ==> 
30031(ms)
  [  360.359408] rcu:     (detected by 21, t=60038 jiffies, g=25009, q=1123 
ncpus=72)
  [  360.366704] Sending NMI from CPU 21 to CPUs 54:
  [  370.367310] rcu: rcu_preempt kthread starved for 9993 jiffies! g25009 f0x0 
RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=31
  [  370.377983] rcu:     Unless rcu_preempt kthread gets sufficient CPU time, 
OOM is now expected behavior.
  [  370.387322] rcu: RCU grace-period kthread stack dump:
  [  370.392482] task:rcu_preempt     state:I stack:0     pid:17    tgid:17    
ppid:2      flags:0x00000008
  [  370.392488] Call trace:
  [  370.392489]  __switch_to+0xd0/0x118
  [  370.392499]  __schedule+0x2a8/0x7b0
  [  370.392501]  schedule+0x40/0x168
  [  370.392502]  schedule_timeout+0xac/0x1e0
  [  370.392505]  rcu_gp_fqs_loop+0x128/0x508
  [  370.392512]  rcu_gp_kthread+0x150/0x188
  [  370.392514]  kthread+0xf8/0x110
  [  370.392519]  ret_from_fork+0x10/0x20
  [  370.392524] rcu: Stack dump where RCU GP kthread last ran:
  [  370.398128] Sending NMI from CPU 21 to CPUs 31:
  [  370.398131] NMI backtrace for cpu 31
  [  370.398136] CPU: 31 PID: 0 Comm: swapper/31 Kdump: loaded Tainted: G       
    OE      6.8.0-1005-nvidia-64k #5-Ubuntu
  [  370.398139] Hardware name:  /P3880, BIOS         01.02.01 20240207
  [  370.398140] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
  [  370.398142] pc : cpuidle_enter_state+0xd8/0x790
  [  370.398150] lr : cpuidle_enter_state+0xcc/0x790
  [  370.398153] sp : ffff800081eefd70
  [  370.398154] x29: ffff800081eefd70 x28: 0000000000000000 x27: 
0000000000000000
  [  370.398157] x26: 0000000000000000 x25: 000000563d67e4e0 x24: 
0000000000000000
  [  370.398160] x23: ffffa0a1445699f8 x22: 0000000000000000 x21: 
000000563d72ece0
  [  370.398162] x20: ffffa0a144569a10 x19: ffff00008fa4a800 x18: 
ffff800081f00030
  [  370.398165] x17: 0000000000000000 x16: 0000000000000000 x15: 
0000ac8c73b08db0
  [  370.398168] x14: 0000000000000000 x13: 0000000000000000 x12: 
0000000000000000
  [  370.398170] x11: 0000000000000000 x10: 2da0fbe3d5e8c649 x9 : 
ffffa0a1424fd244
  [  370.398173] x8 : ffff0000820559b8 x7 : 0000000000000000 x6 : 
0000000000000000
  [  370.398175] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 
0000000000000000
  [  370.398178] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 
0000000000000000
  [  370.398181] Call trace:
  [  370.398183]  cpuidle_enter_state+0xd8/0x790
  [  370.398185]  cpuidle_enter+0x44/0x78
  [  370.398195]  cpuidle_idle_call+0x15c/0x210
  [  370.398202]  do_idle+0xb0/0x130
  [  370.398204]  cpu_startup_entry+0x40/0x50
  [  370.398206]  secondary_start_kernel+0xec/0x130
  [  370.398211]  __secondary_switched+0xc0/0xc8
  [  370.399132] Kernel panic - not syncing: RCU Stall
  [  370.403938] CPU: 21 PID: 0 Comm: swapper/21 Kdump: loaded Tainted: G       
    OE      6.8.0-1005-nvidia-64k #5-Ubuntu
  [  370.414876] Hardware name:  /P3880, BIOS         01.02.01 20240207
  [  370.421192] Call trace:
  [  370.423686]  dump_backtrace+0xa4/0x150
  [  370.427514]  show_stack+0x24/0x50
  [  370.430896]  dump_stack_lvl+0x78/0xf8
  [  370.434640]  dump_stack+0x1c/0x38
  [  370.438023]  panic+0x3a4/0x440
  [  370.441141]  print_other_cpu_stall+0x578/0x610
  [  370.445681]  check_cpu_stall+0x240/0x300
  [  370.449686]  rcu_pending+0x44/0x220
  [  370.453248]  rcu_sched_clock_irq+0x7c/0x2c8
  [  370.457519]  update_process_times+0x7c/0xf8
  [  370.461794]  tick_sched_handle+0x3c/0x98
  [  370.465803]  tick_nohz_highres_handler+0x5c/0xe8
  [  370.470520]  __hrtimer_run_queues+0x164/0x398
  [  370.474969]  hrtimer_interrupt+0xf4/0x278
  [  370.479063]  arch_timer_handler_phys+0x38/0x80
  [  370.483607]  handle_percpu_devid_irq+0x94/0x2b8
  [  370.488238]  generic_handle_domain_irq+0x38/0x70
  [  370.492954]  __gic_handle_irq_from_irqson.isra.0+0x180/0x310
  [  370.498743]  gic_handle_irq+0x2c/0xa0
  [  370.502481]  call_on_irq_stack+0x3c/0x50
  [  370.506486]  do_interrupt_handler+0xb0/0xc8
  [  370.510759]  el1_interrupt+0x48/0xf0
  [  370.514409]  el1h_64_irq_handler+0x1c/0x40
  [  370.518592]  el1h_64_irq+0x7c/0x80
  [  370.522063]  cpuidle_enter_state+0xd8/0x790
  [  370.526336]  cpuidle_enter+0x44/0x78
  [  370.529986]  cpuidle_idle_call+0x15c/0x210
  [  370.534169]  do_idle+0xb0/0x130
  [  370.537375]  cpu_startup_entry+0x44/0x50
  [  370.541380]  secondary_start_kernel+0xec/0x130
  [  370.545919]  __secondary_switched+0xc0/0xc8
  [  370.550197] SMP: stopping secondary CPUs
  [  371.601076] SMP: failed to stop secondary CPUs 0-20,22-71
  [  371.607097] Starting crashdump kernel...
  [  371.611103] ------------[ cut here ]------------
  [  371.615820] Some CPUs may be stale, kdump will be unreliable.
  [  371.621695] WARNING: CPU: 21 PID: 0 at 
arch/arm64/kernel/machine_kexec.c:174 machine_kexec+0x48/0x1f0
  [  371.631124] Modules linked in: nvidia(OE+) ecc qrtr cfg80211 binfmt_misc 
dax_hmem cxl_acpi cxl_core nvidia_cspmu acpi_ipmi ast cdc_ether cdc_subset 
arm_smmuv3_pmu arm_cspmu_module coresight_trbe usbnet arm_spe_pmu ipmi_ssif 
i2c_algo_bit uio_pdrv_genirq uio spi_nor ipmi_devintf ipmi_msghandler 
nls_iso8859_1 stm_p_basic coresight_stm stm_core cppc_cpufreq coresight_etm4x 
coresight_tmc coresight_funnel acpi_power_meter coresight dm_multipath 
efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic 
raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor 
xor_neon raid6_pq libcrc32c raid1 raid0 mlx5_ib ib_uverbs macsec ib_core 
mlx5_dpll crct10dif_ce mlx5_core polyval_ce polyval_generic ghash_ce sm4_ce_gcm 
sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce mlxfw sm3 nvme psample sha3_ce 
i2c_smbus sha2_ce nvme_core tls sha256_arm64 xhci_pci sha1_ce xhci_pci_renesas 
pci_hyperv_intf nvme_auth i2c_tegra aes_neon_bs aes_neon_blk aes_ce_blk 
aes_ce_cipher
  [  371.719810] CPU: 21 PID: 0 Comm: swapper/21 Kdump: loaded Tainted: G       
    OE      6.8.0-1005-nvidia-64k #5-Ubuntu
  [  371.730748] Hardware name:  /P3880, BIOS         01.02.01 20240207
  [  371.737064] pstate: 634000c9 (nZCv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
  [  371.744180] pc : machine_kexec+0x48/0x1f0
  [  371.748275] lr : machine_kexec+0x48/0x1f0
  [  371.752369] sp : ffff8000802afa10
  [  371.755751] x29: ffff8000802afa10 x28: 0000000000000463 x27: 
000000000000003c
  [  371.763047] x26: 00000000000000c0 x25: 0000000000000280 x24: 
ffffa0a144268cb4
  [  371.770341] x23: ffffa0a14439f540 x22: ffffa0a1447cf4c0 x21: 
ffffa0a14481a000
  [  371.777636] x20: ffff0000d987e000 x19: ffff0000d987e000 x18: 
ffff800080ba0088
  [  371.784930] x17: 0000000000000000 x16: 0000000000000000 x15: 
0000000000000463
  [  371.792225] x14: 0000000000000000 x13: 2e656c6261696c65 x12: 
726e75206562206c
  [  371.799519] x11: 6c697720706d7564 x10: 0000000000000000 x9 : 
0000000000000000
  [  371.806814] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 
0000000000000000
  [  371.814108] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 
0000000000000000
  [  371.821402] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 
0000000000000000
  [  371.828696] Call trace:
  [  371.831189]  machine_kexec+0x48/0x1f0
  [  371.834928]  __crash_kexec+0x94/0x128
  [  371.838668]  panic+0x380/0x440
  [  371.841784]  print_other_cpu_stall+0x578/0x610
  [  371.846325]  check_cpu_stall+0x240/0x300
  [  371.850331]  rcu_pending+0x44/0x220
  [  371.853892]  rcu_sched_clock_irq+0x7c/0x2c8
  [  371.858163]  update_process_times+0x7c/0xf8
  [  371.862434]  tick_sched_handle+0x3c/0x98
  [  371.866440]  tick_nohz_highres_handler+0x5c/0xe8
  [  371.871156]  __hrtimer_run_queues+0x164/0x398
  [  371.875605]  hrtimer_interrupt+0xf4/0x278
  [  371.879700]  arch_timer_handler_phys+0x38/0x80
  [  371.884240]  handle_percpu_devid_irq+0x94/0x2b8
  [  371.888869]  generic_handle_domain_irq+0x38/0x70
  [  371.893585]  __gic_handle_irq_from_irqson.isra.0+0x180/0x310
  [  371.899368]  gic_handle_irq+0x2c/0xa0
  [  371.903105]  call_on_irq_stack+0x3c/0x50
  [  371.907110]  do_interrupt_handler+0xb0/0xc8
  [  371.911382]  el1_interrupt+0x48/0xf0
  [  371.915032]  el1h_64_irq_handler+0x1c/0x40
  [  371.919215]  el1h_64_irq+0x7c/0x80
  [  371.922686]  cpuidle_enter_state+0xd8/0x790
  [  371.926958]  cpuidle_enter+0x44/0x78
  [  371.930609]  cpuidle_idle_call+0x15c/0x210
  [  371.934793]  do_idle+0xb0/0x130
  [  371.937998]  cpu_startup_entry+0x44/0x50
  [  371.942003]  secondary_start_kernel+0xec/0x130
  [  371.946542]  __secondary_switched+0xc0/0xc8
  [  371.950815] ---[ end trace 0000000000000000 ]---
  
  In an attempt to get more debug info, I tried the open driver in github
  Using https://github.com/NVIDIA/open-gpu-kernel-modules
  Version 550.76- loads successfully
  Version 550.67- loads successfully
  Version 550.54.15 - crashes - which is the same version as the 550 package 
that hangs.  Below is the crash info.  What is interesting is that in an 
attempt to capture more debug into I changed optimization in utils.mk from -O2 
to -O0 and the crash went away.  It also doesn't happen with -O1.
  
  CRASH INFO
  [ 8648.399518] nvidia-nvlink: Nvlink Core is being initialized, major device 
number 506
  [ 8648.399560]
  [ 8648.399718] Internal error: Oops - FPAC: 0000000072000000 [#1] SMP
  [ 8648.407556] Modules linked in: nvidia(OE+) ecdh_generic ecc qrtr cfg80211 
binfmt_misc dax_hmem cxl_acpi cxl_core nvidia_cspmu arm_smmuv3_pmu 
arm_cspmu_module coresight_trbe arm_spe_pmu acpi_ipmi ast cdc_ether cdc_subset 
ipmi_ssif usbnet i2c_algo_bit uio_pdrv_genirq uio spi_nor ipmi_devintf 
ipmi_msghandler nls_iso8859_1 stm_p_basic coresight_stm stm_core cppc_cpufreq 
coresight_etm4x coresight_tmc coresight_funnel acpi_power_meter coresight 
dm_multipath efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs 
blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq 
async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 mlx5_ib 
ib_uverbs macsec ib_core mlx5_dpll mlx5_core crct10dif_ce polyval_ce 
polyval_generic ghash_ce sm4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce 
sm3 mlxfw i2c_smbus nvme psample sha3_ce sha2_ce nvme_core tls sha256_arm64 
xhci_pci sha1_ce xhci_pci_renesas pci_hyperv_intf nvme_auth i2c_tegra 
aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [last unloaded: nvidia(OE)]
  [ 8648.407608]
  [ 8648.501397] CPU: 5 PID: 48130 Comm: insmod Kdump: loaded Tainted: G        
   OE      6.8.0-1004-nvidia-64k #4
  [ 8648.511625] Hardware name:  /P3880, BIOS         01.02.01 20240207
  [ 8648.517941] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
  [ 8648.525058] pc : __kmalloc+0x1e0/0x490
  [ 8648.528892] lr : 0xffffa00000000000
  [ 8648.532482] sp : ffff8000d132f5f0
  [ 8648.535864] x29: ffff8000d132f5f0 x28: 0000000000000000 x27: 
ffffa00084d50484
  [ 8648.543159] x26: 00000000000001f8 x25: 0000000000aa1d70 x24: 
ffff0000c2aba828
  [ 8648.550454] x23: ffffa00085026380 x22: ffff80009d3e0020 x21: 
ffff8000d132f7c8
  [ 8648.557749] x20: 0000000000000038 x19: ffff8000d132f628 x18: 
ffff8000d132f5e4
  [ 8648.565043] x17: 0000000000000000 x16: 0000000000000000 x15: 
0000000000000004
  [ 8648.572337] x14: 0000000000000000 x13: 0000000000000000 x12: 
0000000000000000
  [ 8648.579632] x11: 0000000000000000 x10: ffff8000d132f670 x9 : 
ffffa000806f73ec
  [ 8648.586926] x8 : ffff0000c2a98240 x7 : 0000000000000000 x6 : 
0000000000000000
  [ 8648.594221] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 
0000000000000000
  [ 8648.601516] x2 : 0000000000000000 x1 : ffff000100084480 x0 : 
ffff0000c2a98200
  [ 8648.608810] Call trace:
  [ 8648.611305]  __kmalloc+0x1e0/0x490
  [ 8648.614778]  0x8000604466e4a000
  [ 8648.617986] Code: a9435bf5 a94463f7 910183ff f85f8e5e (d50323bf)
  [ 8648.624219] SMP: stopping secondary CPUs

** Description changed:

  Using both -generic and -nvidia 6.8 kernels I'm seeing a hang when I
  load the nvidia driver.
  
  $ sudo dmidecode -t 0
  # dmidecode 3.5
  Getting SMBIOS data from sysfs.
  SMBIOS 3.6.0 present.
  # SMBIOS implementations newer than version 3.5.0 are not
  # fully supported by this version of dmidecode.
  
  Handle 0x0001, DMI type 0, 26 bytes
  BIOS Information
   Vendor: NVIDIA
   Version:         01.02.01
   Release Date: 20240207
   ROM Size: 64 MB
   Characteristics:
    PCI is supported
    PNP is supported
    BIOS is upgradeable
    BIOS shadowing is allowed
    Boot from CD is supported
    Selectable boot is supported
    Serial services are supported (int 14h)
    ACPI is supported
    Targeted content distribution is supported
    UEFI is supported
   Firmware Revision: 0.0
  
  CONSOLE RCU STALL MESSAGE:
  [  382.938326] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
  [  382.946075] rcu:     53-...0: (4 ticks this GP) 
idle=1c2c/1/0x4000000000000000 softirq=4866/4868 fqs=14124
  [  382.955683] rcu:              hardirqs   softirqs   csw/system
  [  382.961378] rcu:      number:        0          0            0
  [  382.967071] rcu:     cputime:        0          0            0   ==> 
30026(ms)
  [  382.974189] rcu:     (detected by 52, t=60034 jiffies, g=24469, q=1199 
ncpus=72)
  [  392.982095] rcu: rcu_preempt kthread starved for 9994 jiffies! g24469 f0x0 
RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=31
  [  392.992769] rcu:     Unless rcu_preempt kthread gets sufficient CPU time, 
OOM is now expected behavior
  
  After seeing this, I Enabled kdump and set kernel.panic_on_rcu_stall = 1
  
  KDUMP INFO:
  WARNING: cpu 54: cannot find NT_PRSTATUS note
        KERNEL: /usr/lib/debug/boot/vmlinux-6.8.0-1004-nvidia-64k  [TAINTED]
      DUMPFILE: /var/crash/202404172139/dump.202404172139  [PARTIAL DUMP]
          CPUS: 72
          DATE: Wed Apr 17 21:39:13 UTC 2024
        UPTIME: 00:06:10
  LOAD AVERAGE: 0.68, 0.63, 0.28
         TASKS: 854
      NODENAME: hinyari
       RELEASE: 6.8.0-1005-nvidia-64k
       VERSION: #5-Ubuntu SMP PREEMPT_DYNAMIC Wed Apr 17 11:26:46 UTC 2024
       MACHINE: aarch64  (unknown Mhz)
        MEMORY: 479.7 GB
         PANIC: "Kernel panic - not syncing: RCU Stall"
           PID: 0
       COMMAND: "swapper/21"
          TASK: ffff000082026880  (1 of 72)  [THREAD_INFO: ffff000082026880]
           CPU: 21
         STATE: TASK_RUNNING (PANIC)
  
  [  300.313144] nvidia: loading out-of-tree module taints kernel.
  [  300.313153] nvidia: module verification failed: signature and/or required 
key missing - tainting kernel
  [  300.316694] nvidia-nvlink: Nvlink Core is being initialized, major device 
number 506
  [  300.316699]
  [  360.323454] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
  [  360.331206] rcu:     54-...0: (24 ticks this GP) 
idle=742c/1/0x4000000000000000 softirq=4931/4933 fqs=13148
  [  360.340903] rcu:              hardirqs   softirqs   csw/system
  [  360.346597] rcu:      number:        0          0            0
  [  360.352291] rcu:     cputime:        0          0            0   ==> 
30031(ms)
  [  360.359408] rcu:     (detected by 21, t=60038 jiffies, g=25009, q=1123 
ncpus=72)
  [  360.366704] Sending NMI from CPU 21 to CPUs 54:
  [  370.367310] rcu: rcu_preempt kthread starved for 9993 jiffies! g25009 f0x0 
RCU_GP_DOING_FQS(6) ->state=0x0 ->cpu=31
  [  370.377983] rcu:     Unless rcu_preempt kthread gets sufficient CPU time, 
OOM is now expected behavior.
  [  370.387322] rcu: RCU grace-period kthread stack dump:
  [  370.392482] task:rcu_preempt     state:I stack:0     pid:17    tgid:17    
ppid:2      flags:0x00000008
  [  370.392488] Call trace:
  [  370.392489]  __switch_to+0xd0/0x118
  [  370.392499]  __schedule+0x2a8/0x7b0
  [  370.392501]  schedule+0x40/0x168
  [  370.392502]  schedule_timeout+0xac/0x1e0
  [  370.392505]  rcu_gp_fqs_loop+0x128/0x508
  [  370.392512]  rcu_gp_kthread+0x150/0x188
  [  370.392514]  kthread+0xf8/0x110
  [  370.392519]  ret_from_fork+0x10/0x20
  [  370.392524] rcu: Stack dump where RCU GP kthread last ran:
  [  370.398128] Sending NMI from CPU 21 to CPUs 31:
  [  370.398131] NMI backtrace for cpu 31
  [  370.398136] CPU: 31 PID: 0 Comm: swapper/31 Kdump: loaded Tainted: G       
    OE      6.8.0-1005-nvidia-64k #5-Ubuntu
  [  370.398139] Hardware name:  /P3880, BIOS         01.02.01 20240207
  [  370.398140] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
  [  370.398142] pc : cpuidle_enter_state+0xd8/0x790
  [  370.398150] lr : cpuidle_enter_state+0xcc/0x790
  [  370.398153] sp : ffff800081eefd70
  [  370.398154] x29: ffff800081eefd70 x28: 0000000000000000 x27: 
0000000000000000
  [  370.398157] x26: 0000000000000000 x25: 000000563d67e4e0 x24: 
0000000000000000
  [  370.398160] x23: ffffa0a1445699f8 x22: 0000000000000000 x21: 
000000563d72ece0
  [  370.398162] x20: ffffa0a144569a10 x19: ffff00008fa4a800 x18: 
ffff800081f00030
  [  370.398165] x17: 0000000000000000 x16: 0000000000000000 x15: 
0000ac8c73b08db0
  [  370.398168] x14: 0000000000000000 x13: 0000000000000000 x12: 
0000000000000000
  [  370.398170] x11: 0000000000000000 x10: 2da0fbe3d5e8c649 x9 : 
ffffa0a1424fd244
  [  370.398173] x8 : ffff0000820559b8 x7 : 0000000000000000 x6 : 
0000000000000000
  [  370.398175] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 
0000000000000000
  [  370.398178] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 
0000000000000000
  [  370.398181] Call trace:
  [  370.398183]  cpuidle_enter_state+0xd8/0x790
  [  370.398185]  cpuidle_enter+0x44/0x78
  [  370.398195]  cpuidle_idle_call+0x15c/0x210
  [  370.398202]  do_idle+0xb0/0x130
  [  370.398204]  cpu_startup_entry+0x40/0x50
  [  370.398206]  secondary_start_kernel+0xec/0x130
  [  370.398211]  __secondary_switched+0xc0/0xc8
  [  370.399132] Kernel panic - not syncing: RCU Stall
  [  370.403938] CPU: 21 PID: 0 Comm: swapper/21 Kdump: loaded Tainted: G       
    OE      6.8.0-1005-nvidia-64k #5-Ubuntu
  [  370.414876] Hardware name:  /P3880, BIOS         01.02.01 20240207
  [  370.421192] Call trace:
  [  370.423686]  dump_backtrace+0xa4/0x150
  [  370.427514]  show_stack+0x24/0x50
  [  370.430896]  dump_stack_lvl+0x78/0xf8
  [  370.434640]  dump_stack+0x1c/0x38
  [  370.438023]  panic+0x3a4/0x440
  [  370.441141]  print_other_cpu_stall+0x578/0x610
  [  370.445681]  check_cpu_stall+0x240/0x300
  [  370.449686]  rcu_pending+0x44/0x220
  [  370.453248]  rcu_sched_clock_irq+0x7c/0x2c8
  [  370.457519]  update_process_times+0x7c/0xf8
  [  370.461794]  tick_sched_handle+0x3c/0x98
  [  370.465803]  tick_nohz_highres_handler+0x5c/0xe8
  [  370.470520]  __hrtimer_run_queues+0x164/0x398
  [  370.474969]  hrtimer_interrupt+0xf4/0x278
  [  370.479063]  arch_timer_handler_phys+0x38/0x80
  [  370.483607]  handle_percpu_devid_irq+0x94/0x2b8
  [  370.488238]  generic_handle_domain_irq+0x38/0x70
  [  370.492954]  __gic_handle_irq_from_irqson.isra.0+0x180/0x310
  [  370.498743]  gic_handle_irq+0x2c/0xa0
  [  370.502481]  call_on_irq_stack+0x3c/0x50
  [  370.506486]  do_interrupt_handler+0xb0/0xc8
  [  370.510759]  el1_interrupt+0x48/0xf0
  [  370.514409]  el1h_64_irq_handler+0x1c/0x40
  [  370.518592]  el1h_64_irq+0x7c/0x80
  [  370.522063]  cpuidle_enter_state+0xd8/0x790
  [  370.526336]  cpuidle_enter+0x44/0x78
  [  370.529986]  cpuidle_idle_call+0x15c/0x210
  [  370.534169]  do_idle+0xb0/0x130
  [  370.537375]  cpu_startup_entry+0x44/0x50
  [  370.541380]  secondary_start_kernel+0xec/0x130
  [  370.545919]  __secondary_switched+0xc0/0xc8
  [  370.550197] SMP: stopping secondary CPUs
  [  371.601076] SMP: failed to stop secondary CPUs 0-20,22-71
  [  371.607097] Starting crashdump kernel...
  [  371.611103] ------------[ cut here ]------------
  [  371.615820] Some CPUs may be stale, kdump will be unreliable.
  [  371.621695] WARNING: CPU: 21 PID: 0 at 
arch/arm64/kernel/machine_kexec.c:174 machine_kexec+0x48/0x1f0
  [  371.631124] Modules linked in: nvidia(OE+) ecc qrtr cfg80211 binfmt_misc 
dax_hmem cxl_acpi cxl_core nvidia_cspmu acpi_ipmi ast cdc_ether cdc_subset 
arm_smmuv3_pmu arm_cspmu_module coresight_trbe usbnet arm_spe_pmu ipmi_ssif 
i2c_algo_bit uio_pdrv_genirq uio spi_nor ipmi_devintf ipmi_msghandler 
nls_iso8859_1 stm_p_basic coresight_stm stm_core cppc_cpufreq coresight_etm4x 
coresight_tmc coresight_funnel acpi_power_meter coresight dm_multipath 
efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic 
raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor 
xor_neon raid6_pq libcrc32c raid1 raid0 mlx5_ib ib_uverbs macsec ib_core 
mlx5_dpll crct10dif_ce mlx5_core polyval_ce polyval_generic ghash_ce sm4_ce_gcm 
sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce mlxfw sm3 nvme psample sha3_ce 
i2c_smbus sha2_ce nvme_core tls sha256_arm64 xhci_pci sha1_ce xhci_pci_renesas 
pci_hyperv_intf nvme_auth i2c_tegra aes_neon_bs aes_neon_blk aes_ce_blk 
aes_ce_cipher
  [  371.719810] CPU: 21 PID: 0 Comm: swapper/21 Kdump: loaded Tainted: G       
    OE      6.8.0-1005-nvidia-64k #5-Ubuntu
  [  371.730748] Hardware name:  /P3880, BIOS         01.02.01 20240207
  [  371.737064] pstate: 634000c9 (nZCv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
  [  371.744180] pc : machine_kexec+0x48/0x1f0
  [  371.748275] lr : machine_kexec+0x48/0x1f0
  [  371.752369] sp : ffff8000802afa10
  [  371.755751] x29: ffff8000802afa10 x28: 0000000000000463 x27: 
000000000000003c
  [  371.763047] x26: 00000000000000c0 x25: 0000000000000280 x24: 
ffffa0a144268cb4
  [  371.770341] x23: ffffa0a14439f540 x22: ffffa0a1447cf4c0 x21: 
ffffa0a14481a000
  [  371.777636] x20: ffff0000d987e000 x19: ffff0000d987e000 x18: 
ffff800080ba0088
  [  371.784930] x17: 0000000000000000 x16: 0000000000000000 x15: 
0000000000000463
  [  371.792225] x14: 0000000000000000 x13: 2e656c6261696c65 x12: 
726e75206562206c
  [  371.799519] x11: 6c697720706d7564 x10: 0000000000000000 x9 : 
0000000000000000
  [  371.806814] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 
0000000000000000
  [  371.814108] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 
0000000000000000
  [  371.821402] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 
0000000000000000
  [  371.828696] Call trace:
  [  371.831189]  machine_kexec+0x48/0x1f0
  [  371.834928]  __crash_kexec+0x94/0x128
  [  371.838668]  panic+0x380/0x440
  [  371.841784]  print_other_cpu_stall+0x578/0x610
  [  371.846325]  check_cpu_stall+0x240/0x300
  [  371.850331]  rcu_pending+0x44/0x220
  [  371.853892]  rcu_sched_clock_irq+0x7c/0x2c8
  [  371.858163]  update_process_times+0x7c/0xf8
  [  371.862434]  tick_sched_handle+0x3c/0x98
  [  371.866440]  tick_nohz_highres_handler+0x5c/0xe8
  [  371.871156]  __hrtimer_run_queues+0x164/0x398
  [  371.875605]  hrtimer_interrupt+0xf4/0x278
  [  371.879700]  arch_timer_handler_phys+0x38/0x80
  [  371.884240]  handle_percpu_devid_irq+0x94/0x2b8
  [  371.888869]  generic_handle_domain_irq+0x38/0x70
  [  371.893585]  __gic_handle_irq_from_irqson.isra.0+0x180/0x310
  [  371.899368]  gic_handle_irq+0x2c/0xa0
  [  371.903105]  call_on_irq_stack+0x3c/0x50
  [  371.907110]  do_interrupt_handler+0xb0/0xc8
  [  371.911382]  el1_interrupt+0x48/0xf0
  [  371.915032]  el1h_64_irq_handler+0x1c/0x40
  [  371.919215]  el1h_64_irq+0x7c/0x80
  [  371.922686]  cpuidle_enter_state+0xd8/0x790
  [  371.926958]  cpuidle_enter+0x44/0x78
  [  371.930609]  cpuidle_idle_call+0x15c/0x210
  [  371.934793]  do_idle+0xb0/0x130
  [  371.937998]  cpu_startup_entry+0x44/0x50
  [  371.942003]  secondary_start_kernel+0xec/0x130
  [  371.946542]  __secondary_switched+0xc0/0xc8
  [  371.950815] ---[ end trace 0000000000000000 ]---
  
  In an attempt to get more debug info, I tried the open driver in github
  Using https://github.com/NVIDIA/open-gpu-kernel-modules
  Version 550.76- loads successfully
  Version 550.67- loads successfully
- Version 550.54.15 - crashes - which is the same version as the 550 package 
that hangs.  Below is the crash info.  What is interesting is that in an 
attempt to capture more debug into I changed optimization in utils.mk from -O2 
to -O0 and the crash went away.  It also doesn't happen with -O1.
+ Version 550.54.15 - crashes - which is the same version as the 550 package 
that hangs.  Below is the crash info.  What is interesting is that in an 
attempt to capture more debug info, I changed the optimization flag in utils.mk 
from -O2 to -O0 and the crash went away.  It also doesn't happen with -O1.
  
  CRASH INFO
  [ 8648.399518] nvidia-nvlink: Nvlink Core is being initialized, major device 
number 506
  [ 8648.399560]
  [ 8648.399718] Internal error: Oops - FPAC: 0000000072000000 [#1] SMP
  [ 8648.407556] Modules linked in: nvidia(OE+) ecdh_generic ecc qrtr cfg80211 
binfmt_misc dax_hmem cxl_acpi cxl_core nvidia_cspmu arm_smmuv3_pmu 
arm_cspmu_module coresight_trbe arm_spe_pmu acpi_ipmi ast cdc_ether cdc_subset 
ipmi_ssif usbnet i2c_algo_bit uio_pdrv_genirq uio spi_nor ipmi_devintf 
ipmi_msghandler nls_iso8859_1 stm_p_basic coresight_stm stm_core cppc_cpufreq 
coresight_etm4x coresight_tmc coresight_funnel acpi_power_meter coresight 
dm_multipath efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs 
blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq 
async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 mlx5_ib 
ib_uverbs macsec ib_core mlx5_dpll mlx5_core crct10dif_ce polyval_ce 
polyval_generic ghash_ce sm4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce 
sm3 mlxfw i2c_smbus nvme psample sha3_ce sha2_ce nvme_core tls sha256_arm64 
xhci_pci sha1_ce xhci_pci_renesas pci_hyperv_intf nvme_auth i2c_tegra 
aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher [last unloaded: nvidia(OE)]
  [ 8648.407608]
  [ 8648.501397] CPU: 5 PID: 48130 Comm: insmod Kdump: loaded Tainted: G        
   OE      6.8.0-1004-nvidia-64k #4
  [ 8648.511625] Hardware name:  /P3880, BIOS         01.02.01 20240207
  [ 8648.517941] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
  [ 8648.525058] pc : __kmalloc+0x1e0/0x490
  [ 8648.528892] lr : 0xffffa00000000000
  [ 8648.532482] sp : ffff8000d132f5f0
  [ 8648.535864] x29: ffff8000d132f5f0 x28: 0000000000000000 x27: 
ffffa00084d50484
  [ 8648.543159] x26: 00000000000001f8 x25: 0000000000aa1d70 x24: 
ffff0000c2aba828
  [ 8648.550454] x23: ffffa00085026380 x22: ffff80009d3e0020 x21: 
ffff8000d132f7c8
  [ 8648.557749] x20: 0000000000000038 x19: ffff8000d132f628 x18: 
ffff8000d132f5e4
  [ 8648.565043] x17: 0000000000000000 x16: 0000000000000000 x15: 
0000000000000004
  [ 8648.572337] x14: 0000000000000000 x13: 0000000000000000 x12: 
0000000000000000
  [ 8648.579632] x11: 0000000000000000 x10: ffff8000d132f670 x9 : 
ffffa000806f73ec
  [ 8648.586926] x8 : ffff0000c2a98240 x7 : 0000000000000000 x6 : 
0000000000000000
  [ 8648.594221] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 
0000000000000000
  [ 8648.601516] x2 : 0000000000000000 x1 : ffff000100084480 x0 : 
ffff0000c2a98200
  [ 8648.608810] Call trace:
  [ 8648.611305]  __kmalloc+0x1e0/0x490
  [ 8648.614778]  0x8000604466e4a000
  [ 8648.617986] Code: a9435bf5 a94463f7 910183ff f85f8e5e (d50323bf)
  [ 8648.624219] SMP: stopping secondary CPUs

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2062380

Title:
  Using a 6.8 kernel 'modprobe nvidia' hangs on Quanta Grace Hopper

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-535-server/+bug/2062380/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to