Hi, That seems to confirm our suspicion, we'll try to get it working with XenServer, but it seems like 8.2 is required for RTX A6000, so it will probably take a while with sales teams etc.
Our current config is quite similar, where a couple beefy VMs have some Quadro RTX 5000 passed through for acceleration but sharing is quite limited. It's based on the blog from Pisz as well as this one: https://mathiashueber.com/pci-passthrough-ubuntu-2004-virtual-machine/ I've attached our internal guide for how this is done in case someone is interested! I'll update you guys if we can get it to work, it seems like there aren't a lot of docs on this other than the confluence ticket. Unfortunate that it's locked behind vGPU and Citrix licenses! Have a good CCC :) Pierre On Sun, Nov 13, 2022 at 11:41 AM Jayanth Reddy <jayanthreddy5...@gmail.com> wrote: > Hi, > > This blog at > https://lab.piszki.pl/cloudstack-kvm-and-running-vm-with-vgpu/ > helped us a lot. Thanks to Piotr Pisz, he's active in the CloudStack > community as well. > Below are the steps we followed to make this work. It doesn't show up in > the GPU count in the zone or somewhere else on CloudStack. > We've got Intel Processors btw. If yours is AMD, please make appropriate > changes. > > 1. Enable VT-d at the BIOS. > 2. Since the latest Kernel which comes with Ubuntu has vfio drivers as > in-built Kernel Modules, there is no need to additionally load them. > 3. We have 2 GPU 2080 Ti cards inserted on the server and are as follows, > > *37:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 > [GeForce RTX 2080 Ti Rev. A] [10de:1e07] (rev a1)* > > * Subsystem: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] > [10de:12fa]* > > *37:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition > Audio Controller [10de:10f7] (rev a1)* > > * Subsystem: NVIDIA Corporation TU102 High Definition Audio Controller > [10de:12fa]* > > *37:00.2 USB controller [0c03]: NVIDIA Corporation TU102 USB 3.1 Host > Controller [10de:1ad6] (rev a1)* > > * Subsystem: NVIDIA Corporation TU102 USB 3.1 Host Controller [10de:12fa]* > > *37:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C > UCSI Controller [10de:1ad7] (rev a1)* > > * Subsystem: NVIDIA Corporation TU102 USB Type-C UCSI Controller > [10de:12fa]* > > *86:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 > [GeForce RTX 2080 Ti Rev. A] [10de:1e07] (rev a1)* > > * Subsystem: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] > [10de:12fa]* > > *86:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition > Audio Controller [10de:10f7] (rev a1)* > > * Subsystem: NVIDIA Corporation TU102 High Definition Audio Controller > [10de:12fa]* > > *86:00.2 USB controller [0c03]: NVIDIA Corporation TU102 USB 3.1 Host > Controller [10de:1ad6] (rev a1)* > > * Subsystem: NVIDIA Corporation TU102 USB 3.1 Host Controller [10de:12fa]* > > *86:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C > UCSI Controller [10de:1ad7] (rev a1)* > > * Subsystem: NVIDIA Corporation TU102 USB Type-C UCSI Controller > [10de:12fa]* > > > 4. Add GRUB parameters in */etc/default/grub* as > > > *GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt > modprobe.blacklist=xhci_hcd > vfio-pci.ids=10de:1e07,10de:10f7,10de:1ad6,10de:1ad7"* > > > You may also do this by placing relevant items under > */etc/modprobe.d/vfio.conf > *or similar > > > 5. Blacklist any additional drivers on the host that take-over the GPU at > */etc/modprobe.d/blacklist-nvidia.conf* > > > $ cat /etc/modprobe.d/blacklist-nvidia.conf > > blacklist nouveau > > blacklist nvidia > > blacklist xhci_hcd > > > 6. Do *# update-initramfs -u -k all *and *# update-grub2 *and reboot the > server. > > > 7. Verify IOMMU is enabled and working # dmesg -T | grep -iE "iommu|dmar" > > > 8. One important thing is to make sure all devices are bound to the > vfio-pci driver. In our case, *xhci_pci *driver was taking over the "USB > 3.1 Host Controller". If this is the case, QEMU fails to spawn stating > "make sure all devices in the IOMMU group are bound to the vfio driver". To > tackle this, we'd write a script to properly unbind xhci_pci and use > vfio-pci as the driver at boot. So, we wrote, > > > *cat /etc/initramfs-tools/scripts/init-top/vfio.sh* > > *#!/bin/sh* > > > *PREREQ=""* > > > *prereqs()* > > *{* > > * echo "$PREREQ"* > > *}* > > > *case $1 in* > > *prereqs)* > > * prereqs* > > * exit 0* > > * ;;* > > *esac* > > > *for dev in 0000:37:00.2 0000:86:00.2* > > *do * > > * echo -n "$dev" > /sys/bus/pci/drivers/xhci_hcd/unbind* > > * echo -n "$dev" > /sys/bus/pci/drivers/vfio-pci/bind* > > *done* > > > *exit 0* > > > Then # update-initramfs -u -k all > > > 9. Once everything is good, go ahead and reboot the machine, and your > devices should look something like the below, the *vfio-pci* kernel module > should be the "Kernel driver in use". > > > 37:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce > RTX 2080 Ti Rev. A] [10de:1e07] (rev a1) > > Subsystem: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] > [10de:12fa] > > Kernel driver in use: vfio-pci > > Kernel modules: nouveau > > 37:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio > Controller [10de:10f7] (rev a1) > > Subsystem: NVIDIA Corporation TU102 High Definition Audio Controller > [10de:12fa] > > Kernel driver in use: vfio-pci > > Kernel modules: snd_hda_intel > > 37:00.2 USB controller [0c03]: NVIDIA Corporation TU102 USB 3.1 Host > Controller [10de:1ad6] (rev a1) > > Subsystem: NVIDIA Corporation TU102 USB 3.1 Host Controller [10de:12fa] > > Kernel driver in use: vfio-pci > > Kernel modules: xhci_pci > > 37:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C > UCSI Controller [10de:1ad7] (rev a1) > > Subsystem: NVIDIA Corporation TU102 USB Type-C UCSI Controller [10de:12fa] > > Kernel driver in use: vfio-pci > > > 86:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce > RTX 2080 Ti Rev. A] [10de:1e07] (rev a1) > > Subsystem: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] > [10de:12fa] > > Kernel driver in use: vfio-pci > > Kernel modules: nouveau > > 86:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio > Controller [10de:10f7] (rev a1) > > Subsystem: NVIDIA Corporation TU102 High Definition Audio Controller > [10de:12fa] > > Kernel driver in use: vfio-pci > > Kernel modules: snd_hda_intel > > 86:00.2 USB controller [0c03]: NVIDIA Corporation TU102 USB 3.1 Host > Controller [10de:1ad6] (rev a1) > > Subsystem: NVIDIA Corporation TU102 USB 3.1 Host Controller [10de:12fa] > > Kernel driver in use: vfio-pci > > Kernel modules: xhci_pci > > 86:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C > UCSI Controller [10de:1ad7] (rev a1) > > Subsystem: NVIDIA Corporation TU102 USB Type-C UCSI Controller [10de:12fa] > > Kernel driver in use: vfio-pci > > > 10. Go to CloudStack, create a VM as per the normal workflow, then stop it > and insert the additional settings as below. This can be done from the > WebUI and API as well. > > > extraconfig-1 > <devices> <hostdev mode='subsystem' type='pci' managed='yes'> <driver > name='vfio' /> <source> <address domain='0x0000' bus='0x37' slot='0x00' > function='0x0' /> </source> </hostdev> </devices> > > extraconfig-2 > *<devices> <hostdev mode='subsystem' type='pci' managed='yes'> <driver > name='vfio' /> <source> <address domain='0x0000' bus='0x86' slot='0x00' > function='0x0' /> </source> </hostdev> </devices>* > > > Replace the device address as per in your environment. We'd have to put 2 > extraconfigs as we have 2 GPUs. Even extraconfig-1 can accomodate for both > devices as well. > > > 11. Once done, schedule the VM on the host, CloudStack will compose the XML > with the additional configuration (in extraconifg-n) specified, and the > qemu process for that VM will have additional arguments as > > > -device vfio-pci,host=0000:37:00.0,id=hostdev0,bus=pci.0,addr=0x6 -device > vfio-pci,host=0000:86:00.0,id=hostdev1,bus=pci.0,addr=0x7 > > > > 12. Once the VM is up, check using *lspci *or similar and finally install > nvidia-smi and confirm everything is working fine. > > > > Thanks > > On Sun, Nov 13, 2022 at 3:25 PM Alex Mattioli <alex.matti...@shapeblue.com > > > wrote: > > > Hi Jay, > > I'd love to hear more about how you implemented the GPU pass-through, and > > I think it could be quite useful for the community as well. > > > > Cheers > > Alex > > > > > > > > > > -----Original Message----- > > From: Jayanth Reddy <jayanthreddy5...@gmail.com> > > Sent: 13 November 2022 10:43 > > To: users@cloudstack.apache.org > > Cc: Emil Karlsson <emi...@kth.se> > > Subject: Re: vGPU support in CloudStack on Ubuntu KVM > > > > Hi, > > AFAIK, vGPU and GPU are only supported on Xen Hypervisor as per > > > https://cwiki.apache.org/confluence/display/CLOUDSTACK/GPU+and+vGPU+support+for+CloudStack+Guest+VMs > > . > > Not sure about the vGPU but we managed to do a full GPU passthrough to a > > VM running on Ubuntu KVM Host. Interested to discuss further? > > > > Thanks > > > > > > On Wed, Oct 12, 2022 at 6:40 PM Pierre Le Fevre <pierr...@kth.se> wrote: > > > > > Hi all, > > > I am currently trying to get vGPU to work in some of our VMs in > > > cloudstack to enable GPU acceleration in Jupyter Notebooks. > > > Our current setup is using CloudStack 4.17.1.0 on Ubuntu 20.04 with > > > KVM as a hypervisor. > > > It seems like there is some support for vGPU in CloudStack but I can't > > > find proper documentation about compatibility with newer GPUs and > > > other hypervisors. > > > > > > I've tried installing all the proper drivers on the host machine with > > > a NVIDIA A6000 and it shows up properly in nvidia-smi. From the docs > > > available it seems like it should show up in the UI under the host > > > after adding it to our cluster, but no GPU appears. The dashboard also > > > reports 0 GPUs in the zone. > > > > > > Is this a limitation of KVM, or have some of you gotten this setup to > > work? > > > > > > All the best > > > Pierre > > > > > >