We run Ubuntu 22.04 in Azure in clusters (kubernetes) and many of the agent 
pools are GPU enabled VMs.

We also do updates for security patches and we noticed a problem last week.

The update to the latest kernel, linux-image-5.15.0-1063-azure 
(5.15.0-1063.72), caused the GPU driver to fail.

In fact, it actually could have seen this and not upgraded the kernel but it 
ignored the failure and continued anyway.

This made all of those nodes unable to run GPU workloads.  I could not find the 
right place to report this.

There are really two problems - the first is that the kernel and GPU driver are 
not happy with each other.  The other is that even through it failed in that 
critical step, it continued to install the kernel.

You can see this in the output:

Setting up linux-azure-headers-5.15.0-1063 (5.15.0-1063.72) ...
Setting up linux-modules-5.15.0-1063-azure (5.15.0-1063.72) ...
Setting up linux-headers-5.15.0-1063-azure (5.15.0-1063.72) ...
/etc/kernel/header_postinst.d/dkms:
 * dkms: running auto installation service for kernel 5.15.0-1063-azure

Kernel preparation unnecessary for this kernel. Skipping...

Building module:
cleaning build area...
'make' -j4 NV_EXCLUDE_BUILD_MODULES='nvidia-drm ' 
KERNEL_UNAME=5.15.0-1063-azure modules..................(bad exit status: 2)
Error! Bad return status for module build on kernel: 5.15.0-1063-azure (x86_64)
Consult /var/lib/dkms/nvidia/535.54.03/build/make.log for more information.
   ...done.
Setting up linux-azure-cloud-tools-5.15.0-1063 (5.15.0-1063.72) ...
Setting up linux-image-5.15.0-1063-azure (5.15.0-1063.72) ...
I: /boot/vmlinuz is now a symlink to vmlinuz-5.15.0-1063-azure
I: /boot/initrd.img is now a symlink to initrd.img-5.15.0-1063-azure
Setting up linux-azure-tools-5.15.0-1063 (5.15.0-1063.72) ...
Setting up linux-cloud-tools-5.15.0-1063-azure (5.15.0-1063.72) ...
Setting up linux-cloud-tools-azure-lts-22.04 (5.15.0.1063.61) ...
Setting up linux-headers-azure-lts-22.04 (5.15.0.1063.61) ...
Setting up linux-image-azure-lts-22.04 (5.15.0.1063.61) ...
Setting up linux-tools-5.15.0-1063-azure (5.15.0-1063.72) ...
Setting up linux-tools-azure-lts-22.04 (5.15.0.1063.61) ...
Processing triggers for linux-image-5.15.0-1063-azure (5.15.0-1063.72) ...
/etc/kernel/postinst.d/dkms:
 * dkms: running auto installation service for kernel 5.15.0-1063-azure

Kernel preparation unnecessary for this kernel. Skipping...

Building module:
cleaning build area...
'make' -j4 NV_EXCLUDE_BUILD_MODULES='nvidia-drm ' 
KERNEL_UNAME=5.15.0-1063-azure modules..................(bad exit status: 2)
Error! Bad return status for module build on kernel: 5.15.0-1063-azure (x86_64)
Consult /var/lib/dkms/nvidia/535.54.03/build/make.log for more information.
   ...done.
/etc/kernel/postinst.d/initramfs-tools:
update-initramfs: Generating /boot/initrd.img-5.15.0-1063-azure
/etc/kernel/postinst.d/zz-update-grub:
Sourcing file `/etc/default/grub'
Sourcing file `/etc/default/grub.d/40-force-partuuid.cfg'
Sourcing file `/etc/default/grub.d/50-cloudimg-settings.cfg'
Sourcing file `/etc/default/grub.d/init-select.cfg'
Generating grub configuration file ...

The reason the build failed (end of the 
/var/lib/dkms/nvidia/535.54.03/build/make.log file)

  LD [M]  /var/lib/dkms/nvidia/535.54.03/build/nvidia-peermem.o
  MODPOST /var/lib/dkms/nvidia/535.54.03/build/Module.symvers
ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 
'rcu_read_unlock_strict'
make[2]: *** [scripts/Makefile.modpost:133: 
/var/lib/dkms/nvidia/535.54.03/build/Module.symvers] Error 1
make[2]: *** Deleting file '/var/lib/dkms/nvidia/535.54.03/build/Module.symvers'
make[1]: *** [Makefile:1830: modules] Error 2
make[1]: Leaving directory '/usr/src/linux-headers-5.15.0-1063-azure'
make: *** [Makefile:82: modules] Error 2



__

𝕄𝕚𝕔𝕙𝕒𝕖𝕝 𝕊𝕚𝕟𝕫 – Architect – 緑 – Microsoft
-- 
Ubuntu-quality mailing list
Ubuntu-quality@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/ubuntu-quality

Reply via email to