** Changed in: fabric-manager-535 (Ubuntu)
Assignee: (unassigned) => Mitchell Augustin (mitchellaugustin)
** Changed in: linux (Ubuntu)
Assignee: (unassigned) => Mitchell Augustin (mitchellaugustin)
** Changed in: nvidia-graphics-drivers-535-server (Ubuntu)
Assignee: (unassigned) => Mitchell Augustin (mitchellaugustin)
** Changed in: fabric-manager-535 (Ubuntu)
Status: New => Fix Released
** Changed in: linux (Ubuntu)
Status: New => Fix Released
** Changed in: nvidia-graphics-drivers-535-server (Ubuntu)
Status: New => Fix Released
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2052663
Title:
fabric-manager-535 setup fails during install on Grace/Hopper arm64
system running noble
Status in fabric-manager-535 package in Ubuntu:
Fix Released
Status in linux package in Ubuntu:
Fix Released
Status in nvidia-graphics-drivers-535-server package in Ubuntu:
Fix Released
Bug description:
This error occurs on both the standard and largemem variants of the latest
Noble server build of Ubuntu:
Ubuntu Noble Numbat (development branch) (GNU/Linux 6.6.0-14-generic-64k
aarch64) (iso link:
https://cdimage.ubuntu.com/ubuntu-server/daily-live/20240207.1/noble-live-server-arm64+largemem.iso)
Ubuntu Noble Numbat (development branch) (GNU/Linux 6.6.0-14-generic aarch64)
(iso link:
https://cdimage.ubuntu.com/ubuntu-server/daily-live/20240207/noble-live-server-arm64.iso)
CPU/GPU: Nvidia Grace/Hopper
lsb_release -rd:
No LSB modules are available.
Description: Ubuntu Noble Numbat (development branch)
Release: 24.04
Kernel versions affected:
GNU/Linux 6.6.0-14-generic-64k aarch64
GNU/Linux 6.6.0-14-generic aarch64
Package version: nvidia-fabricmanager-535 (535.154.05-0ubuntu1 arm64)
Expected behavior: Package starts as expected during post-install
setup steps
Actual behavior:
On our grace/hopper system running noble, when installing
nvidia-fabricmanager-535, the installation froze at 60% twice, along with all
ssh processes. I am also unable to ssh back into the system after this happens.
This is the last output I see from my installer shell:
+ apt install -y nvidia-fabricmanager-535
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
nvidia-fabricmanager-535
0 upgraded, 1 newly installed, 0 to remove and 1 not upgraded.
Need to get 1795 kB of archives.
After this operation, 8679 kB of additional disk space will be used.
Get:1 http://ports.ubuntu.com/ubuntu-ports noble/multiverse arm64
nvidia-fabricmanager-535 arm64 535.154.05-0ubuntu1 [1795 kB]
Fetched 1795 kB in 1s (2439 kB/s)
Selecting previously unselected package nvidia-fabricmanager-535.
(Reading database ... 103745 files and directories currently installed.)
Preparing to unpack
.../nvidia-fabricmanager-535_535.154.05-0ubuntu1_arm64.deb ...
Unpacking nvidia-fabricmanager-535 (535.154.05-0ubuntu1) ...
Setting up nvidia-fabricmanager-535 (535.154.05-0ubuntu1) ...
Created symlink
/etc/systemd/system/multi-user.target.wants/nvidia-fabricmanager.service →
/lib/systemd/system/nvidia-fabricmanager.service.
Progress: [ 60%]
[#################################################################################.......................................................]
This does not appear to cause a panic/reboot, as I can still interact with
the console, and it even appears that the apt process is still running in ps
aux (although it doesn't seem to progress). However, I observe the following
output in the console that I believe may be related:
[ 1453.814597] watchdog: BUG: soft lockup - CPU#16 stuck for 670s!
[(udev-worker):33269]
[ 1477.814602] watchdog: BUG: soft lockup - CPU#16 stuck for 693s!
[(udev-worker):33269]
[ 1501.814606] watchdog: BUG: soft lockup - CPU#16 stuck for 715s!
[(udev-worker):33269]
[ 1525.814611] watchdog: BUG: soft lockup - CPU#16 stuck for 738s!
[(udev-worker):33269]
[ 1579.666718] rcu: INFO: rcu_preempt detected expedited stalls on
CPUs/tasks: { 17-...D } 240893 ji
ffies s: 653 root: 0x2/.
[ 1579.678114] rcu: blocking rcu_node structures (internal RCU debug):
l=1:15-29:0x4/.
[ 1597.814625] watchdog: BUG: soft lockup - CPU#16 stuck for 805s!
[(udev-worker):33269]
[ 1621.814630] watchdog: BUG: soft lockup - CPU#16 stuck for 827s!
[(udev-worker):33269]
[ 1630.562655] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1630.568973] rcu: 17-...0: (1 GPs behind) idle=2444/1/0x4000000000000000
softirq=13696/13700 f
qs=126842
[ 1630.578665] rcu: hardirqs softirqs csw/system
[ 1630.584381] rcu: number: 0 0 0
[ 1630.590109] rcu: cputime: 0 0 0 ==>
1110384(ms)
[ 1630.597458] rcu: (detected by 20, t=285099 jiffies, g=74061, q=113266
ncpus=72)
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/fabric-manager-535/+bug/2052663/+subscriptions
--
Mailing list: https://launchpad.net/~kernel-packages
Post to : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help : https://help.launchpad.net/ListHelp