Hi, This blog at https://lab.piszki.pl/cloudstack-kvm-and-running-vm-with-vgpu/ helped us a lot. Thanks to Piotr Pisz, he's active in the CloudStack community as well. Below are the steps we followed to make this work. It doesn't show up in the GPU count in the zone or somewhere else on CloudStack. We've got Intel Processors btw. If yours is AMD, please make appropriate changes.
1. Enable VT-d at the BIOS. 2. Since the latest Kernel which comes with Ubuntu has vfio drivers as in-built Kernel Modules, there is no need to additionally load them. 3. We have 2 GPU 2080 Ti cards inserted on the server and are as follows, *37:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] [10de:1e07] (rev a1)* * Subsystem: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] [10de:12fa]* *37:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)* * Subsystem: NVIDIA Corporation TU102 High Definition Audio Controller [10de:12fa]* *37:00.2 USB controller [0c03]: NVIDIA Corporation TU102 USB 3.1 Host Controller [10de:1ad6] (rev a1)* * Subsystem: NVIDIA Corporation TU102 USB 3.1 Host Controller [10de:12fa]* *37:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C UCSI Controller [10de:1ad7] (rev a1)* * Subsystem: NVIDIA Corporation TU102 USB Type-C UCSI Controller [10de:12fa]* *86:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] [10de:1e07] (rev a1)* * Subsystem: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] [10de:12fa]* *86:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)* * Subsystem: NVIDIA Corporation TU102 High Definition Audio Controller [10de:12fa]* *86:00.2 USB controller [0c03]: NVIDIA Corporation TU102 USB 3.1 Host Controller [10de:1ad6] (rev a1)* * Subsystem: NVIDIA Corporation TU102 USB 3.1 Host Controller [10de:12fa]* *86:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C UCSI Controller [10de:1ad7] (rev a1)* * Subsystem: NVIDIA Corporation TU102 USB Type-C UCSI Controller [10de:12fa]* 4. Add GRUB parameters in */etc/default/grub* as *GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt modprobe.blacklist=xhci_hcd vfio-pci.ids=10de:1e07,10de:10f7,10de:1ad6,10de:1ad7"* You may also do this by placing relevant items under */etc/modprobe.d/vfio.conf *or similar 5. Blacklist any additional drivers on the host that take-over the GPU at */etc/modprobe.d/blacklist-nvidia.conf* $ cat /etc/modprobe.d/blacklist-nvidia.conf blacklist nouveau blacklist nvidia blacklist xhci_hcd 6. Do *# update-initramfs -u -k all *and *# update-grub2 *and reboot the server. 7. Verify IOMMU is enabled and working # dmesg -T | grep -iE "iommu|dmar" 8. One important thing is to make sure all devices are bound to the vfio-pci driver. In our case, *xhci_pci *driver was taking over the "USB 3.1 Host Controller". If this is the case, QEMU fails to spawn stating "make sure all devices in the IOMMU group are bound to the vfio driver". To tackle this, we'd write a script to properly unbind xhci_pci and use vfio-pci as the driver at boot. So, we wrote, *cat /etc/initramfs-tools/scripts/init-top/vfio.sh* *#!/bin/sh* *PREREQ=""* *prereqs()* *{* * echo "$PREREQ"* *}* *case $1 in* *prereqs)* * prereqs* * exit 0* * ;;* *esac* *for dev in 0000:37:00.2 0000:86:00.2* *do * * echo -n "$dev" > /sys/bus/pci/drivers/xhci_hcd/unbind* * echo -n "$dev" > /sys/bus/pci/drivers/vfio-pci/bind* *done* *exit 0* Then # update-initramfs -u -k all 9. Once everything is good, go ahead and reboot the machine, and your devices should look something like the below, the *vfio-pci* kernel module should be the "Kernel driver in use". 37:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] [10de:1e07] (rev a1) Subsystem: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] [10de:12fa] Kernel driver in use: vfio-pci Kernel modules: nouveau 37:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1) Subsystem: NVIDIA Corporation TU102 High Definition Audio Controller [10de:12fa] Kernel driver in use: vfio-pci Kernel modules: snd_hda_intel 37:00.2 USB controller [0c03]: NVIDIA Corporation TU102 USB 3.1 Host Controller [10de:1ad6] (rev a1) Subsystem: NVIDIA Corporation TU102 USB 3.1 Host Controller [10de:12fa] Kernel driver in use: vfio-pci Kernel modules: xhci_pci 37:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C UCSI Controller [10de:1ad7] (rev a1) Subsystem: NVIDIA Corporation TU102 USB Type-C UCSI Controller [10de:12fa] Kernel driver in use: vfio-pci 86:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] [10de:1e07] (rev a1) Subsystem: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] [10de:12fa] Kernel driver in use: vfio-pci Kernel modules: nouveau 86:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1) Subsystem: NVIDIA Corporation TU102 High Definition Audio Controller [10de:12fa] Kernel driver in use: vfio-pci Kernel modules: snd_hda_intel 86:00.2 USB controller [0c03]: NVIDIA Corporation TU102 USB 3.1 Host Controller [10de:1ad6] (rev a1) Subsystem: NVIDIA Corporation TU102 USB 3.1 Host Controller [10de:12fa] Kernel driver in use: vfio-pci Kernel modules: xhci_pci 86:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C UCSI Controller [10de:1ad7] (rev a1) Subsystem: NVIDIA Corporation TU102 USB Type-C UCSI Controller [10de:12fa] Kernel driver in use: vfio-pci 10. Go to CloudStack, create a VM as per the normal workflow, then stop it and insert the additional settings as below. This can be done from the WebUI and API as well. extraconfig-1 <devices> <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio' /> <source> <address domain='0x0000' bus='0x37' slot='0x00' function='0x0' /> </source> </hostdev> </devices> extraconfig-2 *<devices> <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio' /> <source> <address domain='0x0000' bus='0x86' slot='0x00' function='0x0' /> </source> </hostdev> </devices>* Replace the device address as per in your environment. We'd have to put 2 extraconfigs as we have 2 GPUs. Even extraconfig-1 can accomodate for both devices as well. 11. Once done, schedule the VM on the host, CloudStack will compose the XML with the additional configuration (in extraconifg-n) specified, and the qemu process for that VM will have additional arguments as -device vfio-pci,host=0000:37:00.0,id=hostdev0,bus=pci.0,addr=0x6 -device vfio-pci,host=0000:86:00.0,id=hostdev1,bus=pci.0,addr=0x7 12. Once the VM is up, check using *lspci *or similar and finally install nvidia-smi and confirm everything is working fine. Thanks On Sun, Nov 13, 2022 at 3:25 PM Alex Mattioli <alex.matti...@shapeblue.com> wrote: > Hi Jay, > I'd love to hear more about how you implemented the GPU pass-through, and > I think it could be quite useful for the community as well. > > Cheers > Alex > > > > > -----Original Message----- > From: Jayanth Reddy <jayanthreddy5...@gmail.com> > Sent: 13 November 2022 10:43 > To: users@cloudstack.apache.org > Cc: Emil Karlsson <emi...@kth.se> > Subject: Re: vGPU support in CloudStack on Ubuntu KVM > > Hi, > AFAIK, vGPU and GPU are only supported on Xen Hypervisor as per > https://cwiki.apache.org/confluence/display/CLOUDSTACK/GPU+and+vGPU+support+for+CloudStack+Guest+VMs > . > Not sure about the vGPU but we managed to do a full GPU passthrough to a > VM running on Ubuntu KVM Host. Interested to discuss further? > > Thanks > > > On Wed, Oct 12, 2022 at 6:40 PM Pierre Le Fevre <pierr...@kth.se> wrote: > > > Hi all, > > I am currently trying to get vGPU to work in some of our VMs in > > cloudstack to enable GPU acceleration in Jupyter Notebooks. > > Our current setup is using CloudStack 4.17.1.0 on Ubuntu 20.04 with > > KVM as a hypervisor. > > It seems like there is some support for vGPU in CloudStack but I can't > > find proper documentation about compatibility with newer GPUs and > > other hypervisors. > > > > I've tried installing all the proper drivers on the host machine with > > a NVIDIA A6000 and it shows up properly in nvidia-smi. From the docs > > available it seems like it should show up in the UI under the host > > after adding it to our cluster, but no GPU appears. The dashboard also > > reports 0 GPUs in the zone. > > > > Is this a limitation of KVM, or have some of you gotten this setup to > work? > > > > All the best > > Pierre > > >