Public bug reported:

[ Impact ]

 * Microsoft Azure NV-series instances with NVidia GRID drivers started
to experience xserver crashes while following Microsoft's official guide
to installing Nvidia drivers [1].

 * Root cause analysis showed that it was due to having a device with
BusID "PCI:0@<domain_id>:0:0", where domain id is >= 32767 while the
hyperv_drm kernel module is loaded.

 * Removing either the BusID specification or unloading the hyperv_drm
kernel module seems to fix the crash.

 * The crash is happening while X.server is trying to enumerate PCI
devices. X.server dereferences a NULL pointer while trying to access to
the PCI device info.

 * The reason why it only happens while the hyperv_drm kernel module is
loaded is that the hyperv_drm module does not expose PCI hardware
information since it's a virtual device.

 * The upstream patch [2] addresses the issue and it's confirmed that
the xserver with the patch does not experience the crash.

 * Ubuntu Focal `xorg-server` package does not include the patch [2] at
the moment (xserver-xorg-core=2:1.20.13-1ubuntu1~20.04.6).

 [1]: 
https://learn.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup#install-grid-drivers-on-nv-or-nvv3-series-vms
 [2]: 
https://github.com/freedesktop/xorg-xserver/commit/0d93bbfa2cfacbb73741f8bed0e32fa1a656b928

[ Test Plan ]

Part (a) is quoted from Microsoft's official guide [1].

Part (a):

 * Spawn a Microsoft Azure NV-series instance with an NVidia GRID-supported GPU
   - e.g. `NV36adms A10`
 * Install updates, required tooling, and the desktop environment:
   - sudo apt-get update
   - sudo apt-get upgrade -y
   - sudo apt-get dist-upgrade -y
   - sudo apt-get install build-essential ubuntu-desktop -y
   - sudo apt-get install linux-azure -y
 * Disable nouveau kernel driver:
   # Create a blacklist file /etc/modprobe.d/nouveau.conf with following 
contents:
   blacklist nouveau
   blacklist lbm-nouveau 
 * Reboot the VM, re-connect, and then stop X server:
   - sudo reboot
   # wait for the reboot, reconnect, and continue:
   - sudo systemctl stop lightdm.service
 * Download and install the NVidia GRID driver:
   - wget -O NVIDIA-Linux-x86_64-grid.run 
https://go.microsoft.com/fwlink/?linkid=874272 
   - chmod +x NVIDIA-Linux-x86_64-grid.run
   - sudo ./NVIDIA-Linux-x86_64-grid.run
   - # When the setup asks whether you want to run the nvidia-xconfig utility 
to update your X configuration file, select Yes.
 * Copy /etc/nvidia/gridd.conf.template to /etc/nvidia/gridd.conf
   - sudo cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf
 * Edit /etc/nvidia/grid.conf
   - sudo nano /etc/nvidia/grid.conf
   # Append the following lines:
   IgnoreSP=FALSE
   EnableUI=FALSE
   # Remove this line if present:
   FeatureType=0
   # And save.
 * Reboot the VM

 Part (b):

  * Ensure that the hyperv_drm kernel module is loaded:
    - sudo modprobe hyperv_drm 
  * Use the attached xorg.conf file to override /etc/X11/xorg.conf file
  * try to start the `xserver`:
    - sudo startx
  * `xserver` should crash with a similar output to the following:
  X.Org X Server 1.20.13
  X Protocol Version 11, Revision 0
  Build Operating System: linux Ubuntu
  Current Operating System: Linux a10test 5.15.0-1031-azure #38~20.04.1-Ubuntu 
SMP Mon Jan 9 18:23:48 UTC 2023 x86_64
  Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-1031-azure 
root=PARTUUID=4cac852b-afba-447b-b3e7-c002155c1305 ro console=tty1 
console=ttyS0 earlyprintk=ttyS0 panic=-1
  Build Date: 07 February 2023  12:48:13PM
  xorg-server 2:1.20.13-1ubuntu1~20.04.6 (For technical support please see 
http://www.ubuntu.com/support) 
  Current version of pixman: 0.38.4
    Before reporting problems, check http://wiki.x.org
    to make sure that you have the latest version.
  Markers: (--) probed, (**) from config file, (==) default setting,
    (++) from command line, (!!) notice, (II) informational,
    (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
  (==) Log file: "/var/log/Xorg.1.log", Time: Sat Feb 18 10:54:26 2023
  (==) Using config file: "/etc/X11/xorg.conf"
  (==) Using system config directory "/usr/share/X11/xorg.conf.d"
  (EE) 
  (EE) Backtrace:
  (EE) 0: /usr/lib/xorg/Xorg (OsLookupColor+0x13c) [0x55e7787c5ecc]
  (EE) 1: /lib/x86_64-linux-gnu/libpthread.so.0 (funlockfile+0x60) 
[0x7f9576cac420]
  (EE) 2: /usr/lib/xorg/Xorg (xf86PlatformDeviceCheckBusID+0xa7) 
[0x55e7786c4db7]
  (EE) 3: /usr/lib/xorg/Xorg (xf86PlatformMatchDriver+0x700) [0x55e7786bf1b0]
  (EE) 4: /usr/lib/xorg/Xorg (xf86CallDriverProbe+0x5c) [0x55e7786971dc]
  (EE) 5: /usr/lib/xorg/Xorg (xf86BusConfig+0x43) [0x55e778697b23]
  (EE) 6: /usr/lib/xorg/Xorg (InitOutput+0x90b) [0x55e7786a59eb]
  (EE) 7: /usr/lib/xorg/Xorg (InitFonts+0x1d4) [0x55e778667ea4]
  (EE) 8: /lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main+0xf3) 
[0x7f9576ac8083]
  (EE) 9: /usr/lib/xorg/Xorg (_start+0x2e) [0x55e778651ace]
  (EE) 
  (EE) Segmentation fault at address 0x124
  (EE) 
  Fatal server error:
  (EE) Caught signal 11 (Segmentation fault). Server aborting
  (EE) 
  (EE) 
  Please consult the The X.Org Foundation support 
     at http://wiki.x.org
   for help. 
  (EE) Please also check the log file at "/var/log/Xorg.1.log" for additional 
information.
  (EE) 
  (EE) Server terminated with error (1). Closing log file.
  ^Cxinit: giving up
  xinit: unable to connect to X server: Connection refused
  xinit: unexpected signal 2

[ Where problems could occur ]

 * The regression risk is low, given that the patch is well-isolated and
basically adds a null check that is already assumed to be there in the
first place.

[ Other Info ]

 * workaround #1: unload hyperv_drm kernel module:
   - sudo modprobe -r hyperv_drm
 * workaround #2: Comment out BusID line in /etc/X11/xorg.conf [Device] section:
   Section "Device"
      Identifier     "Device0"
      Driver         "nvidia"
      VendorName     "NVIDIA Corporation"
      # BusID          "PCI:0@32828:0:0"
      Option         "HardDPMS" "false"
      Option         "CustomEDID" "DFP-0:/etc/X11/vdisplay.edid"
   EndSection

** Affects: xorg-server (Ubuntu)
     Importance: Undecided
         Status: New

** Affects: xorg-server (Ubuntu Focal)
     Importance: Undecided
     Assignee: Mustafa Kemal Gilor (mustafakemalgilor)
         Status: In Progress

** Attachment added: "xorg.conf"
   https://bugs.launchpad.net/bugs/2007746/+attachment/5648222/+files/xorg.conf

** Also affects: xorg-server (Ubuntu Kinetic)
   Importance: Undecided
       Status: New

** Also affects: xorg-server (Ubuntu Focal)
   Importance: Undecided
       Status: New

** Also affects: xorg-server (Ubuntu Jammy)
   Importance: Undecided
       Status: New

** Also affects: xorg-server (Ubuntu Lunar)
   Importance: Undecided
       Status: New

** Also affects: xorg-server (Ubuntu Bionic)
   Importance: Undecided
       Status: New

** No longer affects: xorg-server (Ubuntu Bionic)

** No longer affects: xorg-server (Ubuntu Jammy)

** No longer affects: xorg-server (Ubuntu Kinetic)

** No longer affects: xorg-server (Ubuntu Lunar)

** Changed in: xorg-server (Ubuntu Focal)
       Status: New => In Progress

** Changed in: xorg-server (Ubuntu Focal)
     Assignee: (unassigned) => Mustafa Kemal Gilor (mustafakemalgilor)

-- 
You received this bug notification because you are a member of Ubuntu-X,
which is subscribed to xorg-server in Ubuntu.
https://bugs.launchpad.net/bugs/2007746

Title:
  [SRU] xserver crashes when hyperv_drm kernel module is loaded on azure
  NV series instances w/ nvidia grid driver

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/xorg-server/+bug/2007746/+subscriptions


_______________________________________________
Mailing list: https://launchpad.net/~ubuntu-x-swat
Post to     : ubuntu-x-swat@lists.launchpad.net
Unsubscribe : https://launchpad.net/~ubuntu-x-swat
More help   : https://help.launchpad.net/ListHelp

Reply via email to