andkerber commented on issue #12204:
URL: https://github.com/apache/cloudstack/issues/12204#issuecomment-3928934474
Just a quick update from my side. Cloudstack integration with an NVIDIA H100
GPU in vGPU indeed works fine. I've failed to enable the vGPU profiles on the
OS level and assumed that there might be some issue. I can't say much about the
using MIG mode, which is the original topic of this issue report - I hope this
did not raise too much confusion.
For anyone stumbling across this post I'd like to leave some hints about
using enabling vGPU profiles on the OS level so cloudstack can discover them
sucessfully.
# enable persistence mode
/usr/bin/nvidia-smi -pm 1
# disable mig mode
/usr/bin/nvidia-smi -mig 0
# create the vGPU devices (needed after every reboot)
/usr/lib/nvidia/sriov-manage -e 00000000:20:00.0
# display all profiles/devices
mdevctl types
pick a profile that suits your needs. for example this one:
nvidia-1070
Available instances: 0
Device API: vfio-pci
Name: NVIDIA H100L-11C
Description: num_heads=1, frl_config=60, framebuffer=11264M,
max_resolution=4096x2400, max_instance=8
let's say you want cloudstack to use 4 vGPUs with the profile spec mentioned
above.
use "find" to give you the device path of 4 nvidia-1070:
# find /sys | grep mdev_supported_types.nvidia-1070.*create | sed -e
's/:/\\:/g' | head -4
/sys/devices/pci0000\:1f/0000\:1f\:01.0/0000\:20\:03.2/mdev_supported_types/nvidia-1070/create
/sys/devices/pci0000\:1f/0000\:1f\:01.0/0000\:20\:01.3/mdev_supported_types/nvidia-1070/create
/sys/devices/pci0000\:1f/0000\:1f\:01.0/0000\:20\:03.0/mdev_supported_types/nvidia-1070/create
/sys/devices/pci0000\:1f/0000\:1f\:01.0/0000\:20\:02.6/mdev_supported_types/nvidia-1070/create
now use uuidgen on each device and write it's output to the file listed above
# find /sys | grep mdev_supported_types.nvidia-1070.*create | sed -e
's/:/\\:/g' | head -4 | awk '{print "uuidgen >"$1}'
uuidgen
>/sys/devices/pci0000\:1f/0000\:1f\:01.0/0000\:20\:03.2/mdev_supported_types/nvidia-1070/create
uuidgen
>/sys/devices/pci0000\:1f/0000\:1f\:01.0/0000\:20\:01.3/mdev_supported_types/nvidia-1070/create
uuidgen
>/sys/devices/pci0000\:1f/0000\:1f\:01.0/0000\:20\:03.0/mdev_supported_types/nvidia-1070/create
uuidgen
>/sys/devices/pci0000\:1f/0000\:1f\:01.0/0000\:20\:02.6/mdev_supported_types/nvidia-1070/create
If the above is ok for you, execute the 4 commands.
After that "mdevctl list" will show those 4 devices and cloudstack will be
happy.
If you wan't these devices survive a reboot, you can "define" them and then
configure them to "auto" like this:
mdevctl list | grep manual | awk '{print "mdevctl define --uuid "$1}' | sh
mdevctl list | grep manual | awk '{print "mdevctl modify --auto --uuid "$1}'
| sh
In my case i created 8 devices and the output of mdevctl list looks like
this:
# mdevctl list
0b27fab9-8e8d-4ad7-91bd-6d7ed0b4440e 0000:20:00.7 nvidia-1070 auto (defined)
482b292d-6b15-4370-a3a0-7fd96d8a0cc5 0000:20:01.1 nvidia-1070 auto (defined)
b1efcc41-50f0-461f-a8cb-34ddb69f3820 0000:20:01.3 nvidia-1070 auto (defined)
7946f615-17c5-4035-b401-73923f7f42e5 0000:20:02.4 nvidia-1070 auto (defined)
ea734610-9edb-4db4-a299-3cfc04acd4e8 0000:20:02.6 nvidia-1070 auto (defined)
b2bfcfef-e4e7-460f-9b0a-8f8d61df00b9 0000:20:03.0 nvidia-1070 auto (defined)
d644316e-f2bd-4cb1-96ba-131404833e38 0000:20:03.2 nvidia-1070 auto (defined)
e1be4558-1502-4703-a8dd-260576e0f224 0000:20:04.1 nvidia-1070 auto (defined)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]