Public bug reported: Description =========== When rebuilding a instance with a GPUs attached it may get additional GPUs if there are free available. This number can vary between rebuilds, most of the rebuilds it receive the same amount of GPUs as before the latest rebuild.
Step to reproduce ================= $ openstack flavor show 5fd13401-7daa-464d-acf1-432d29a3dd92 +----------------------------+-----------------------------------------------+ | Field | Value | +----------------------------+-----------------------------------------------+ | OS-FLV-DISABLED:disabled | False | | OS-FLV-EXT-DATA:ephemeral | 0 | | access_project_ids | None | | disk | 80 | | id | 5fd13401-7daa-464d-acf1-432d29a3dd92 | | name | gpu.2.1gpu | | os-flavor-access:is_public | True | | properties | gpu_m10='true', pci_passthrough:alias='M10:1' | | ram | 5000 | | rxtx_factor | 1.0 | | swap | | | vcpus | 2 | +----------------------------+-----------------------------------------------+ $ openstack server create my-gpu-instace --image CentOS-7 --network my- project-network --flavor 5fd13401-7daa-464d-acf1-432d29a3dd92 --key-name my-key --security-group default On the gpu node: [root@g1 ~]# virsh dumpxml instance-0001b22e |grep vfio <driver name='vfio'/> $ openstack server rebuild 29d5a9ba-0829-4e33-9d1c-4ee66b55a940 On the gpu node: [root@g1 ~]# virsh dumpxml instance-0001b22e |grep vfio <driver name='vfio'/> <driver name='vfio'/> <driver name='vfio'/> * The database: MariaDB [nova]> select * from pci_devices where instance_uuid='29d5a9ba-0829-4e33-9d1c-4ee66b55a940'; +---------------------+---------------------+------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+--------------------------------------+------------+-----------+-------------+ | created_at | updated_at | deleted_at | deleted | id | compute_node_id | address | product_id | vendor_id | dev_type | dev_id | label | status | extra_info | instance_uuid | request_id | numa_node | parent_addr | +---------------------+---------------------+------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+--------------------------------------+------------+-----------+-------------+ | 2018-06-27 10:54:44 | 2018-07-04 12:12:57 | NULL | 0 | 6 | 36 | 0000:3e:00.0 | 13bd | 10de | type-PCI | pci_0000_3e_00_0 | label_10de_13bd | allocated | {} | 29d5a9ba-0829-4e33-9d1c-4ee66b55a940 | NULL | 0 | NULL | | 2018-06-27 10:54:44 | 2018-07-06 11:00:21 | NULL | 0 | 9 | 36 | 0000:3f:00.0 | 13bd | 10de | type-PCI | pci_0000_3f_00_0 | label_10de_13bd | allocated | {} | 29d5a9ba-0829-4e33-9d1c-4ee66b55a940 | NULL | 0 | NULL | | 2018-06-27 10:54:44 | 2018-07-04 12:16:31 | NULL | 0 | 12 | 36 | 0000:40:00.0 | 13bd | 10de | type-PCI | pci_0000_40_00_0 | label_10de_13bd | allocated | {} | 29d5a9ba-0829-4e33-9d1c-4ee66b55a940 | NULL | 0 | NULL | +---------------------+---------------------+------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+--------------------------------------+------------+-----------+-------------+ 3 rows in set (0.01 sec) * After some additional rebuilds (5-10), 4 GPUs in the database but only one in visible from virsh MariaDB [nova]> select * from pci_devices where instance_uuid='29d5a9ba-0829-4e33-9d1c-4ee66b55a940'; +---------------------+---------------------+------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+--------------------------------------+------------+-----------+-------------+ | created_at | updated_at | deleted_at | deleted | id | compute_node_id | address | product_id | vendor_id | dev_type | dev_id | label | status | extra_info | instance_uuid | request_id | numa_node | parent_addr | +---------------------+---------------------+------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+--------------------------------------+------------+-----------+-------------+ | 2018-06-27 10:54:44 | 2018-07-04 12:12:57 | NULL | 0 | 6 | 36 | 0000:3e:00.0 | 13bd | 10de | type-PCI | pci_0000_3e_00_0 | label_10de_13bd | allocated | {} | 29d5a9ba-0829-4e33-9d1c-4ee66b55a940 | NULL | 0 | NULL | | 2018-06-27 10:54:44 | 2018-07-06 11:00:21 | NULL | 0 | 9 | 36 | 0000:3f:00.0 | 13bd | 10de | type-PCI | pci_0000_3f_00_0 | label_10de_13bd | allocated | {} | 29d5a9ba-0829-4e33-9d1c-4ee66b55a940 | NULL | 0 | NULL | | 2018-06-27 10:54:44 | 2018-07-04 12:16:31 | NULL | 0 | 12 | 36 | 0000:40:00.0 | 13bd | 10de | type-PCI | pci_0000_40_00_0 | label_10de_13bd | allocated | {} | 29d5a9ba-0829-4e33-9d1c-4ee66b55a940 | NULL | 0 | NULL | | 2018-06-27 10:54:44 | 2018-07-06 12:25:19 | NULL | 0 | 21 | 36 | 0000:dc:00.0 | 13bd | 10de | type-PCI | pci_0000_dc_00_0 | label_10de_13bd | allocated | {} | 29d5a9ba-0829-4e33-9d1c-4ee66b55a940 | NULL | 1 | NULL | +---------------------+---------------------+------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+--------------------------------------+------------+-----------+-------------+ 4 rows in set (0.00 sec) [root@g1 ~]# virsh dumpxml instance-0001b22e |grep "vfio\|uuid>" <uuid>29d5a9ba-0829-4e33-9d1c-4ee66b55a940</uuid> <driver name='vfio'/> Expected result =============== The instance is launched with only one GPGPU after every rebuild. Actual result ============= The instance get rebuilt with unexpected amount of GPGPUs most often the same amount of GPGPU as it had before the last rebuilt. I have observed 1-3 GPGPU. This has been tested on system with 3 NVIDIA Tesla V100, 4 NVIDIA Tesla P100, and a system with two physical NVIDIA M10 (system sees does as 8 GPGPUs, 4 per card). Environment =========== [root@g1 ~]# rpm -qa |grep nova openstack-nova-common-14.1.0-1.el7.noarch openstack-nova-compute-14.1.0-1.el7.noarch python2-novaclient-6.0.2-1.el7.noarch python-nova-14.1.0-1.el7.noarch [root@g1 ~]# rpm -qa |grep -i 'kvm\|qemu\|libvirt' |grep -v daemon libvirt-client-3.9.0-14.el7_5.5.x86_64 qemu-kvm-ev-2.10.0-21.el7_5.3.1.x86_64 libvirt-python-3.9.0-1.el7.x86_64 qemu-img-ev-2.10.0-21.el7_5.3.1.x86_64 qemu-kvm-common-ev-2.10.0-21.el7_5.3.1.x86_64 ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch libvirt-libs-3.9.0-14.el7_5.5.x86_64 [root@g1 ~]# rbd -v ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe) [root@g1 ~]# rpm -qa openstack-neutron* openstack-neutron-common-9.4.1-1.el7.noarch openstack-neutron-9.4.1-1.el7.noarch openstack-neutron-linuxbridge-9.4.1-1.el7.noarch openstack-neutron-ml2-9.4.1-1.el7.noarch Logs & Configs ============== I don't know what config/log files would be most useful and I won't put a dump online, but I'm sure that I can grep for stuff if necessary. [root@devel1 ~]# grep ^pci_alias /etc/nova/nova.conf pci_alias={"vendor_id":"10de","product_id":"13bd","name":"M10"} ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1780441 Title: Rebuild does not respect number of PCIe devices Status in OpenStack Compute (nova): New Bug description: Description =========== When rebuilding a instance with a GPUs attached it may get additional GPUs if there are free available. This number can vary between rebuilds, most of the rebuilds it receive the same amount of GPUs as before the latest rebuild. Step to reproduce ================= $ openstack flavor show 5fd13401-7daa-464d-acf1-432d29a3dd92 +----------------------------+-----------------------------------------------+ | Field | Value | +----------------------------+-----------------------------------------------+ | OS-FLV-DISABLED:disabled | False | | OS-FLV-EXT-DATA:ephemeral | 0 | | access_project_ids | None | | disk | 80 | | id | 5fd13401-7daa-464d-acf1-432d29a3dd92 | | name | gpu.2.1gpu | | os-flavor-access:is_public | True | | properties | gpu_m10='true', pci_passthrough:alias='M10:1' | | ram | 5000 | | rxtx_factor | 1.0 | | swap | | | vcpus | 2 | +----------------------------+-----------------------------------------------+ $ openstack server create my-gpu-instace --image CentOS-7 --network my-project-network --flavor 5fd13401-7daa-464d-acf1-432d29a3dd92 --key-name my-key --security-group default On the gpu node: [root@g1 ~]# virsh dumpxml instance-0001b22e |grep vfio <driver name='vfio'/> $ openstack server rebuild 29d5a9ba-0829-4e33-9d1c-4ee66b55a940 On the gpu node: [root@g1 ~]# virsh dumpxml instance-0001b22e |grep vfio <driver name='vfio'/> <driver name='vfio'/> <driver name='vfio'/> * The database: MariaDB [nova]> select * from pci_devices where instance_uuid='29d5a9ba-0829-4e33-9d1c-4ee66b55a940'; +---------------------+---------------------+------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+--------------------------------------+------------+-----------+-------------+ | created_at | updated_at | deleted_at | deleted | id | compute_node_id | address | product_id | vendor_id | dev_type | dev_id | label | status | extra_info | instance_uuid | request_id | numa_node | parent_addr | +---------------------+---------------------+------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+--------------------------------------+------------+-----------+-------------+ | 2018-06-27 10:54:44 | 2018-07-04 12:12:57 | NULL | 0 | 6 | 36 | 0000:3e:00.0 | 13bd | 10de | type-PCI | pci_0000_3e_00_0 | label_10de_13bd | allocated | {} | 29d5a9ba-0829-4e33-9d1c-4ee66b55a940 | NULL | 0 | NULL | | 2018-06-27 10:54:44 | 2018-07-06 11:00:21 | NULL | 0 | 9 | 36 | 0000:3f:00.0 | 13bd | 10de | type-PCI | pci_0000_3f_00_0 | label_10de_13bd | allocated | {} | 29d5a9ba-0829-4e33-9d1c-4ee66b55a940 | NULL | 0 | NULL | | 2018-06-27 10:54:44 | 2018-07-04 12:16:31 | NULL | 0 | 12 | 36 | 0000:40:00.0 | 13bd | 10de | type-PCI | pci_0000_40_00_0 | label_10de_13bd | allocated | {} | 29d5a9ba-0829-4e33-9d1c-4ee66b55a940 | NULL | 0 | NULL | +---------------------+---------------------+------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+--------------------------------------+------------+-----------+-------------+ 3 rows in set (0.01 sec) * After some additional rebuilds (5-10), 4 GPUs in the database but only one in visible from virsh MariaDB [nova]> select * from pci_devices where instance_uuid='29d5a9ba-0829-4e33-9d1c-4ee66b55a940'; +---------------------+---------------------+------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+--------------------------------------+------------+-----------+-------------+ | created_at | updated_at | deleted_at | deleted | id | compute_node_id | address | product_id | vendor_id | dev_type | dev_id | label | status | extra_info | instance_uuid | request_id | numa_node | parent_addr | +---------------------+---------------------+------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+--------------------------------------+------------+-----------+-------------+ | 2018-06-27 10:54:44 | 2018-07-04 12:12:57 | NULL | 0 | 6 | 36 | 0000:3e:00.0 | 13bd | 10de | type-PCI | pci_0000_3e_00_0 | label_10de_13bd | allocated | {} | 29d5a9ba-0829-4e33-9d1c-4ee66b55a940 | NULL | 0 | NULL | | 2018-06-27 10:54:44 | 2018-07-06 11:00:21 | NULL | 0 | 9 | 36 | 0000:3f:00.0 | 13bd | 10de | type-PCI | pci_0000_3f_00_0 | label_10de_13bd | allocated | {} | 29d5a9ba-0829-4e33-9d1c-4ee66b55a940 | NULL | 0 | NULL | | 2018-06-27 10:54:44 | 2018-07-04 12:16:31 | NULL | 0 | 12 | 36 | 0000:40:00.0 | 13bd | 10de | type-PCI | pci_0000_40_00_0 | label_10de_13bd | allocated | {} | 29d5a9ba-0829-4e33-9d1c-4ee66b55a940 | NULL | 0 | NULL | | 2018-06-27 10:54:44 | 2018-07-06 12:25:19 | NULL | 0 | 21 | 36 | 0000:dc:00.0 | 13bd | 10de | type-PCI | pci_0000_dc_00_0 | label_10de_13bd | allocated | {} | 29d5a9ba-0829-4e33-9d1c-4ee66b55a940 | NULL | 1 | NULL | +---------------------+---------------------+------------+---------+----+-----------------+--------------+------------+-----------+----------+------------------+-----------------+-----------+------------+--------------------------------------+------------+-----------+-------------+ 4 rows in set (0.00 sec) [root@g1 ~]# virsh dumpxml instance-0001b22e |grep "vfio\|uuid>" <uuid>29d5a9ba-0829-4e33-9d1c-4ee66b55a940</uuid> <driver name='vfio'/> Expected result =============== The instance is launched with only one GPGPU after every rebuild. Actual result ============= The instance get rebuilt with unexpected amount of GPGPUs most often the same amount of GPGPU as it had before the last rebuilt. I have observed 1-3 GPGPU. This has been tested on system with 3 NVIDIA Tesla V100, 4 NVIDIA Tesla P100, and a system with two physical NVIDIA M10 (system sees does as 8 GPGPUs, 4 per card). Environment =========== [root@g1 ~]# rpm -qa |grep nova openstack-nova-common-14.1.0-1.el7.noarch openstack-nova-compute-14.1.0-1.el7.noarch python2-novaclient-6.0.2-1.el7.noarch python-nova-14.1.0-1.el7.noarch [root@g1 ~]# rpm -qa |grep -i 'kvm\|qemu\|libvirt' |grep -v daemon libvirt-client-3.9.0-14.el7_5.5.x86_64 qemu-kvm-ev-2.10.0-21.el7_5.3.1.x86_64 libvirt-python-3.9.0-1.el7.x86_64 qemu-img-ev-2.10.0-21.el7_5.3.1.x86_64 qemu-kvm-common-ev-2.10.0-21.el7_5.3.1.x86_64 ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch libvirt-libs-3.9.0-14.el7_5.5.x86_64 [root@g1 ~]# rbd -v ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe) [root@g1 ~]# rpm -qa openstack-neutron* openstack-neutron-common-9.4.1-1.el7.noarch openstack-neutron-9.4.1-1.el7.noarch openstack-neutron-linuxbridge-9.4.1-1.el7.noarch openstack-neutron-ml2-9.4.1-1.el7.noarch Logs & Configs ============== I don't know what config/log files would be most useful and I won't put a dump online, but I'm sure that I can grep for stuff if necessary. [root@devel1 ~]# grep ^pci_alias /etc/nova/nova.conf pci_alias={"vendor_id":"10de","product_id":"13bd","name":"M10"} To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1780441/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp