chandrakantrai opened a new issue, #11574: URL: https://github.com/apache/cloudstack/issues/11574
### problem When starting a VM with GPU allocation, the dashboard shows the GPU devices as allocated for a few seconds, but then they are reset to Free state by the StatsCollector. This happens even though the VM is running **As a result:** - The GPU lifecycle and inventory tracking become inconsistent - New GPU VM creation fails because CloudStack attempts to assign the same GPU that was previously allocated to another running VM **Environment** - CloudStack version: 4.21.0 - OS / Distro:Ubuntu 24.04.3 LTS - Hypervisor: KVM - GPU hardware: NVIDIA H100 - VM ID - 54 **Steps to Reproduce** - Create a GPU-enabled service offering. ( 2 GPUs, in this case) - Deploy a VM using this service offering. - Observe GPU devices are marked as Allocated initially. - Within a few seconds, StatsCollector resets them to Free in the UI / DB. - Try to deploy another GPU VM → fails because CloudStack attempts to reallocate the same GPU device. **Database output :** mysql> select * from gpu_device; +----+--------------------------------------+---------+-----------------+-------------+------+---------+-------+-----------+--------------+----------------------+-----------+---------------+ | id | uuid | card_id | vgpu_profile_id | bus_address | type | host_id | vm_id | numa_node | pci_root | parent_gpu_device_id | state | managed_state | +----+--------------------------------------+---------+-----------------+-------------+------+---------+-------+-----------+--------------+----------------------+-----------+---------------+ | 23 | a998e66c-3147-4018-bf4e-45c34445c847 | 1 | 1 | 04:00.0 | PCI | 4 | 54 | 0 | 0000:04:00.0 | NULL | Allocated | Managed | | 25 | 60b63179-0500-4f77-8a51-b837554f802b | 1 | 1 | 43:00.0 | PCI | 4 | 54 | 0 | 0000:43:00.0 | NULL | Allocated | Managed | ... output omitted for brevity +----+--------------------------------------+---------+-----------------+-------------+------+---------+-------+-----------+--------------+----------------------+-----------+---------------+ 8 rows in set (0.00 sec) **After few second All the assigned GPU is free in UI and database.** mysql> select * from gpu_device; +----+--------------------------------------+---------+-----------------+-------------+------+---------+-------+-----------+--------------+----------------------+-------+---------------+ | id | uuid | card_id | vgpu_profile_id | bus_address | type | host_id | vm_id | numa_node | pci_root | parent_gpu_device_id | state | managed_state | +----+--------------------------------------+---------+-----------------+-------------+------+---------+-------+-----------+--------------+----------------------+-------+---------------+ | 23 | a998e66c-3147-4018-bf4e-45c34445c847 | 1 | 1 | 04:00.0 | PCI | 4 | NULL | 0 | 0000:04:00.0 | NULL | Free | Managed | | 25 | 60b63179-0500-4f77-8a51-b837554f802b | 1 | 1 | 43:00.0 | PCI | 4 | NULL | 0 | 0000:43:00.0 | NULL | Free | Managed | +----+--------------------------------------+---------+-----------------+-------------+------+---------+-------+-----------+--------------+----------------------+-------+---------------+ 8 rows in set (0.00 sec) **Cloudstack manager log :** ! 2025-09-04 12:34:48,335 INFO [o.a.c.f.j.i.AsyncJobMonitor] (API-Job-Executor-7:[ctx-4efa690c, job-2764]) (logid:21ffde37) Add job-2764 into job monitoring 2025-09-04 12:34:48,387 INFO [o.a.c.a.DynamicRoleBasedAPIAccessChecker] (qtp1404565079-4416:[ctx-cdb8cbf8, ctx-51e20bb4]) (logid:54a46693) Account for user id 73618f4e-87c6-11f0-91aa-525400abd4cf is Root Admin or Domain Admin, all APIs are allowed. 2025-09-04 12:34:48,508 INFO [c.c.a.m.a.i.FirstFitRoutingAllocator] (API-Job-Executor-7:[ctx-4efa690c, job-2764, ctx-3f6350d8, FirstFitRoutingAllocator]) (logid:df61abd2) Guest VM is requested with Custom[UEFI] Boot Type false 2025-09-04 12:34:48,550 INFO [c.c.d.DeploymentPlanningManagerImpl] (API-Job-Executor-7:[ctx-4efa690c, job-2764, ctx-3f6350d8]) (logid:df61abd2) Re-ordering hosts [Host {"id":4,"name":"innmi1csh1-p002","type":"Routing","uuid":"3f67cdbc-cf17-4190-bd77-c6fb8ff41ecd"}] by priorities {} 2025-09-04 12:34:48,556 INFO [c.c.d.DeploymentPlanningManagerImpl] (API-Job-Executor-7:[ctx-4efa690c, job-2764, ctx-3f6350d8]) (logid:df61abd2) Hosts after re-ordering are: [Host {"id":4,"name":"innmi1csh1-p002","type":"Routing","uuid":"3f67cdbc-cf17-4190-bd77-c6fb8ff41ecd"}] 2025-09-04 12:34:49,335 INFO [c.c.c.ClusterManagerImpl] (Cluster-Heartbeat-1:[ctx-702df207]) (logid:59cd45c1) No inactive management server node found 2025-09-04 12:34:49,855 INFO [o.a.c.f.j.i.AsyncJobMonitor] (Work-Job-Executor-7:[ctx-e3263ad5, job-2764/job-2765]) (logid:2ee18b91) Add job-2765 into job monitoring 2025-09-04 12:34:50,191 INFO [c.c.n.e.VpcVirtualRouterElement] (Work-Job-Executor-7:[ctx-e3263ad5, job-2764/job-2765, ctx-05034172]) (logid:df61abd2) Adding VPC routers to Guest Network: 1 to be added! 2025-09-04 12:34:50,829 INFO [c.c.c.ClusterManagerImpl] (Cluster-Heartbeat-1:[ctx-603277ba]) (logid:d89f668b) No inactive management server node found 2025-09-04 12:34:51,170 INFO [o.a.c.a.DynamicRoleBasedAPIAccessChecker] (qtp1404565079-4455:[ctx-190018db, ctx-0dae51cf]) (logid:7fe357da) Account for user id 73618f4e-87c6-11f0-91aa-525400abd4cf is Root Admin or Domain Admin, all APIs are allowed. 2025-09-04 12:34:51,481 INFO [o.a.c.a.DynamicRoleBasedAPIAccessChecker] (qtp1404565079-2154:[ctx-9019da22, ctx-d82cc3c0]) (logid:a310b4d8) Account for user id 73618f4e-87c6-11f0-91aa-525400abd4cf is Root Admin or Domain Admin, all APIs are allowed. 2025-09-04 12:34:51,813 INFO [o.a.c.g.GpuServiceImpl] (Work-Job-Executor-7:[ctx-e3263ad5, job-2764/job-2765, ctx-05034172]) (logid:df61abd2) Allocated 2 GPU devices using single NUMA node strategy 2025-09-04 12:34:52,333 INFO [c.c.c.ClusterManagerImpl] (Cluster-Heartbeat-1:[ctx-12b50656]) (logid:30f02e07) No inactive management server node found 2025-09-04 12:34:53,828 INFO [c.c.c.ClusterManagerImpl] (Cluster-Heartbeat-1:[ctx-643657a7]) (logid:b608bd4b) No inactive management server node found 2025-09-04 12:34:54,674 INFO [o.a.c.a.DynamicRoleBasedAPIAccessChecker] (qtp1404565079-4416:[ctx-b6f0da0e, ctx-0f091485]) (logid:616fce75) Account for user id 73618f4e-87c6-11f0-91aa-525400abd4cf is Root Admin or Domain Admin, all APIs are allowed. 2025-09-04 12:34:55,354 INFO [c.c.c.ClusterManagerImpl] (Cluster-Heartbeat-1:[ctx-0e046f30]) (logid:1d29c60a) No inactive management server node found 2025-09-04 12:34:55,949 INFO [o.a.c.f.j.i.AsyncJobMonitor] (Work-Job-Executor-7:[ctx-e3263ad5, job-2764/job-2765]) (logid:df61abd2) Remove job-2765 from job monitoring 2025-09-04 12:34:56,114 INFO [o.a.c.f.j.i.AsyncJobMonitor] (API-Job-Executor-7:[ctx-4efa690c, job-2764]) (logid:df61abd2) Remove job-2764 from job monitoring 2025-09-04 12:34:56,260 INFO [o.a.c.a.DynamicRoleBasedAPIAccessChecker] (qtp1404565079-4455:[ctx-440ad258, ctx-a3c60023]) (logid:0ed16d6e) Account for user id 73618f4e-87c6-11f0-91aa-525400abd4cf is Root Admin or Domain Admin, all APIs are allowed. 2025-09-04 12:34:56,835 INFO [c.c.c.ClusterManagerImpl] (Cluster-Heartbeat-1:[ctx-cbc66d8e]) (logid:0f7bde7d) No inactive management server node found 2025-09-04 12:34:57,837 INFO [o.a.c.a.DynamicRoleBasedAPIAccessChecker] (qtp1404565079-2154:[ctx-75b28a2e, ctx-a21ffd1b]) (logid:b2b9e83e) Account for user id 73618f4e-87c6-11f0-91aa-525400abd4cf is Root Admin or Domain Admin, all APIs are allowed. 2025-09-04 12:34:58,087 INFO [o.a.c.a.DynamicRoleBasedAPIAccessChecker] (qtp1404565079-4416:[ctx-aff94480, ctx-b8607cb3]) (logid:c2e9a913) Account for user id 73618f4e-87c6-11f0-91aa-525400abd4cf is Root Admin or Domain Admin, all APIs are allowed. 2025-09-04 12:34:58,330 INFO [c.c.c.ClusterManagerImpl] (Cluster-Heartbeat-1:[ctx-3c3d3284]) (logid:17037509) No inactive management server node found 2025-09-04 12:34:58,987 WARN [c.c.u.s.Script] (StatsCollector-6:[ctx-c8ab652c]) (logid:1ab365c9) Execution of process [162515] for command [/bin/bash -c systemctl status cloudstack-usage | grep " Loaded:" ] failed. 2025-09-04 12:34:58,987 WARN [c.c.u.s.Script] (StatsCollector-6:[ctx-c8ab652c]) (logid:1ab365c9) Process [162515] for command [/bin/bash -c systemctl status cloudstack-usage | grep " Loaded:" ] encountered the error: [Unit cloudstack-usage.service could not be found.]. 2025-09-04 12:34:59,020 INFO [c.c.s.S.ManagementServerCollector] (StatsCollector-6:[ctx-c8ab652c]) (logid:1ab365c9) system memory from /proc: 16371363840 2025-09-04 12:34:59,031 INFO [c.c.s.S.ManagementServerCollector] (StatsCollector-6:[ctx-c8ab652c]) (logid:1ab365c9) free memory from /proc: 11027075072 2025-09-04 12:34:59,076 INFO [c.c.s.S.ManagementServerCollector] (StatsCollector-6:[ctx-c8ab652c]) (logid:1ab365c9) used memory from /proc: 1446256 2025-09-04 12:34:59,243 INFO [c.c.v.d.VmStatsDaoImpl] (StatsCollector-2:[ctx-cb2988fc]) (logid:bbfaa994) Removed a total of [3] vm_stats rows older than [Thu Sep 04 00:34:59 IST 2025]. 2025-09-04 12:34:59,833 INFO [c.c.c.ClusterManagerImpl] (Cluster-Heartbeat-1:[ctx-b506ddae]) (logid:53a19df8) No inactive management server node found 2025-09-04 12:35:01,332 INFO [c.c.c.ClusterManagerImpl] (Cluster-Heartbeat-1:[ctx-78eeb1b3]) (logid:3bae1866) No inactive management server node found 2025-09-04 12:35:02,830 INFO [c.c.c.ClusterManagerImpl] (Cluster-Heartbeat-1:[ctx-9e1bba38]) (logid:8700e919) No inactive management server node found 2025-09-04 12:35:03,189 INFO [o.a.c.v.s.VMSchedulerImpl] (VMSchedulerPollTask:[ctx-aacd3167]) (logid:f674a6bf) Cleaned up 0 VM scheduled job entries 2025-09-04 12:35:03,343 INFO [o.a.c.a.DynamicRoleBasedAPIAccessChecker] (qtp1404565079-4455:[ctx-801f00d3, ctx-397e56a1]) (logid:57231ce1) Account for user id 73618f4e-87c6-11f0-91aa-525400abd4cf is Root Admin or Domain Admin, all APIs are allowed. 2025-09-04 12:35:04,347 INFO [c.c.c.ClusterManagerImpl] (Cluster-Heartbeat-1:[ctx-d836e142]) (logid:2766c6d5) No inactive management server node found 2025-09-04 12:35:05,834 INFO [c.c.c.ClusterManagerImpl] (Cluster-Heartbeat-1:[ctx-6c3a521a]) (logid:c8463fc5) No inactive management server node found 2025-09-04 12:35:07,332 INFO [c.c.c.ClusterManagerImpl] (Cluster-Heartbeat-1:[ctx-54f84601]) (logid:1d275b17) No inactive management server node found 2025-09-04 12:35:08,841 INFO [c.c.c.ClusterManagerImpl] (Cluster-Heartbeat-1:[ctx-89fe221f]) (logid:1f8850f9) No inactive management server node found 2025-09-04 12:35:10,332 INFO [c.c.c.ClusterManagerImpl] (Cluster-Heartbeat-1:[ctx-0c25c8ff]) (logid:cb689ff8) No inactive management server node found 2025-09-04 12:35:11,833 INFO [c.c.c.ClusterManagerImpl] (Cluster-Heartbeat-1:[ctx-7267c603]) (logid:bb6187c9) No inactive management server node found 2025-09-04 12:35:13,102 INFO [o.a.c.a.DynamicRoleBasedAPIAccessChecker] (qtp1404565079-2154:[ctx-b19f2b1d, ctx-ebc7afdd]) (logid:39fdcc01) Account for user id 73618f4e-87c6-11f0-91aa-525400abd4cf is Root Admin or Domain Admin, all APIs are allowed. 2025-09-04 12:35:13,331 INFO [c.c.c.ClusterManagerImpl] (Cluster-Heartbeat-1:[ctx-6fb48541]) (logid:f8cdbf10) No inactive management server node found 2025-09-04 12:35:13,376 WARN [o.a.c.g.GpuServiceImpl] (StatsCollector-5:[ctx-3ae1d26a]) (logid:0d5dbfa5) VM with ID 54 not found for GPU device GpuDevice {"busAddress":"04:00.0","cardId":1,"hostId":4,"id":23,"numaNode":"0","parentGpuDeviceId":null,"pciRoot":"0000:04:00.0","state":"Allocated","uuid":"a998e66c-3147-4018-bf4e-45c34445c847","vgpuProfileId":1,"vmId":54}. Allocated to a removed VM. Setting state to Free. 2025-09-04 12:35:13,432 WARN [o.a.c.g.GpuServiceImpl] (StatsCollector-5:[ctx-3ae1d26a]) (logid:0d5dbfa5) VM with ID 54 not found for GPU device GpuDevice {"busAddress":"43:00.0","cardId":1,"hostId":4,"id":25,"numaNode":"0","parentGpuDeviceId":null,"pciRoot":"0000:43:00.0","state":"Allocated","uuid":"60b63179-0500-4f77-8a51-b837554f802b","vgpuProfileId":1,"vmId":54}. Allocated to a removed VM. Setting state to Free. 2025-09-04 12:35:14,851 INFO [c.c.c.ClusterManagerImpl] (Cluster-Heartbeat-1:[ctx-0181189b]) (logid:15900e7b) No inactive management server node found **VM status in DB:** mysql> SELECT * FROM vm_instance WHERE id = 54; +----+---------------+--------------------------------------+---------------+---------+----------------+-------------+---------------------+--------------------+--------+----------------+---------+--------------+----------+---------------------+----------------------------------------------------------------------+------------+---------------+--------------+---------------------+---------------------+---------+------+---------+------------+-----------+---------------------+--------------------------------------+-----------------+-------+---------------+---------------+---------------+----------------------+------------+-------------+-------------------------+--------------------------+------------+---------+--------------------+--------------------+----------------+-------------------+ | id | name | uuid | instance_name | state | vm_template_id | guest_os_id | private_mac_address | private_ip_address | pod_id | data_center_id | host_id | last_host_id | proxy_id | proxy_assign_time | vnc_password | ha_enabled | limit_cpu_use | update_count | update_time | created | removed | type | vm_type | account_id | domain_id | service_offering_id | reservation_id | hypervisor_type | owner | host_name | display_name | desired_state | dynamically_scalable | display_vm | power_state | power_state_update_time | power_state_update_count | power_host | user_id | backup_offering_id | backup_external_id | backup_volumes | delete_protection | +----+---------------+--------------------------------------+---------------+---------+----------------+-------------+---------------------+--------------------+--------+----------------+---------+--------------+----------+---------------------+----------------------------------------------------------------------+------------+---------------+--------------+---------------------+---------------------+---------+------+---------+------------+-----------+---------------------+--------------------------------------+-----------------+-------+---------------+---------------+---------------+----------------------+------------+-------------+-------------------------+--------------------------+------------+---------+--------------------+--------------------+----------------+-------------------+ | 54 | vmwith2gpu-01 | 22c80f57-84ad-415a-9a61-f027aabcc4b1 | i-2-54-VM | Running | 204 | 381 | 02:01:00:cc:00:10 | 10.1.1.83 | 1 | 1 | 4 | 4 | 38 | 2025-09-03 07:43:26 | b/Aqozg0vfODSYaDiU9P9emMyZzEeMx7Y/8q8I5fWvm+bnzMDX86LOQo/m/holKKv3Y= | 0 | 0 | 23 | 2025-09-04 07:04:55 | 2025-09-03 07:42:15 | NULL | User | User | 2 | 1 | 14 | fe287204-fa5b-44a3-8542-5c4d10b3f24d | KVM | 2 | vmwith2gpu-01 | vmwith2gpu-01 | NULL | 0 | 1 | PowerOn | 2025-09-04 12:01:09 | 2 | 4 | 2 | NULL | NULL | NULL | 0 | +----+---------------+--------------------------------------+---------------+---------+----------------+-------------+---------------------+--------------------+--------+----------------+---------+--------------+----------+---------------------+----------------------------------------------------------------------+------------+---------------+--------------+---------------------+---------------------+---------+------+---------+------------+-----------+---------------------+--------------------------------------+-----------------+-------+---------------+---------------+---------------+----------------------+------------+-------------+-------------------------+--------------------------+------------+---------+--------------------+--------------------+----------------+-------------------+ 1 row in set (0.00 sec) ### versions The versions of ACS, hypervisors, storage, network etc.. ### The steps to reproduce the bug 1. 2. 3. ... ### What to do about it? _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@cloudstack.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org