We run Ubuntu 22.04 in Azure in clusters (kubernetes) and many of the agent pools are GPU enabled VMs.
We also do updates for security patches and we noticed a problem last week. The update to the latest kernel, linux-image-5.15.0-1063-azure (5.15.0-1063.72), caused the GPU driver to fail. In fact, it actually could have seen this and not upgraded the kernel but it ignored the failure and continued anyway. This made all of those nodes unable to run GPU workloads. I could not find the right place to report this. There are really two problems - the first is that the kernel and GPU driver are not happy with each other. The other is that even through it failed in that critical step, it continued to install the kernel. You can see this in the output: Setting up linux-azure-headers-5.15.0-1063 (5.15.0-1063.72) ... Setting up linux-modules-5.15.0-1063-azure (5.15.0-1063.72) ... Setting up linux-headers-5.15.0-1063-azure (5.15.0-1063.72) ... /etc/kernel/header_postinst.d/dkms: * dkms: running auto installation service for kernel 5.15.0-1063-azure Kernel preparation unnecessary for this kernel. Skipping... Building module: cleaning build area... 'make' -j4 NV_EXCLUDE_BUILD_MODULES='nvidia-drm ' KERNEL_UNAME=5.15.0-1063-azure modules..................(bad exit status: 2) Error! Bad return status for module build on kernel: 5.15.0-1063-azure (x86_64) Consult /var/lib/dkms/nvidia/535.54.03/build/make.log for more information. ...done. Setting up linux-azure-cloud-tools-5.15.0-1063 (5.15.0-1063.72) ... Setting up linux-image-5.15.0-1063-azure (5.15.0-1063.72) ... I: /boot/vmlinuz is now a symlink to vmlinuz-5.15.0-1063-azure I: /boot/initrd.img is now a symlink to initrd.img-5.15.0-1063-azure Setting up linux-azure-tools-5.15.0-1063 (5.15.0-1063.72) ... Setting up linux-cloud-tools-5.15.0-1063-azure (5.15.0-1063.72) ... Setting up linux-cloud-tools-azure-lts-22.04 (5.15.0.1063.61) ... Setting up linux-headers-azure-lts-22.04 (5.15.0.1063.61) ... Setting up linux-image-azure-lts-22.04 (5.15.0.1063.61) ... Setting up linux-tools-5.15.0-1063-azure (5.15.0-1063.72) ... Setting up linux-tools-azure-lts-22.04 (5.15.0.1063.61) ... Processing triggers for linux-image-5.15.0-1063-azure (5.15.0-1063.72) ... /etc/kernel/postinst.d/dkms: * dkms: running auto installation service for kernel 5.15.0-1063-azure Kernel preparation unnecessary for this kernel. Skipping... Building module: cleaning build area... 'make' -j4 NV_EXCLUDE_BUILD_MODULES='nvidia-drm ' KERNEL_UNAME=5.15.0-1063-azure modules..................(bad exit status: 2) Error! Bad return status for module build on kernel: 5.15.0-1063-azure (x86_64) Consult /var/lib/dkms/nvidia/535.54.03/build/make.log for more information. ...done. /etc/kernel/postinst.d/initramfs-tools: update-initramfs: Generating /boot/initrd.img-5.15.0-1063-azure /etc/kernel/postinst.d/zz-update-grub: Sourcing file `/etc/default/grub' Sourcing file `/etc/default/grub.d/40-force-partuuid.cfg' Sourcing file `/etc/default/grub.d/50-cloudimg-settings.cfg' Sourcing file `/etc/default/grub.d/init-select.cfg' Generating grub configuration file ... The reason the build failed (end of the /var/lib/dkms/nvidia/535.54.03/build/make.log file) LD [M] /var/lib/dkms/nvidia/535.54.03/build/nvidia-peermem.o MODPOST /var/lib/dkms/nvidia/535.54.03/build/Module.symvers ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict' make[2]: *** [scripts/Makefile.modpost:133: /var/lib/dkms/nvidia/535.54.03/build/Module.symvers] Error 1 make[2]: *** Deleting file '/var/lib/dkms/nvidia/535.54.03/build/Module.symvers' make[1]: *** [Makefile:1830: modules] Error 2 make[1]: Leaving directory '/usr/src/linux-headers-5.15.0-1063-azure' make: *** [Makefile:82: modules] Error 2 __ 𝕄𝕚𝕔𝕙𝕒𝕖𝕝 𝕊𝕚𝕟𝕫 – Architect – 緑 – Microsoft -- Ubuntu-quality mailing list Ubuntu-quality@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-quality