chandrakantrai opened a new issue, #11574:
URL: https://github.com/apache/cloudstack/issues/11574

   ### problem
   
   When starting a VM with GPU allocation, the dashboard shows the GPU devices 
as allocated for a few seconds, but then they are reset to Free state by the 
StatsCollector.
   This happens even though the VM is running
   
   **As a result:**
   
   - The GPU lifecycle and inventory tracking become inconsistent
   - New GPU VM creation fails because CloudStack attempts to assign the same 
GPU that was previously allocated to another running VM
   
   **Environment**
   
   - CloudStack version: 4.21.0
   - OS / Distro:Ubuntu 24.04.3 LTS
   - Hypervisor: KVM
   - GPU hardware: NVIDIA H100
   - VM ID - 54
   
   
   **Steps to Reproduce**
   
   - Create a GPU-enabled service offering. ( 2 GPUs, in this case)
   - Deploy a VM using this service offering.
   - Observe GPU devices are marked as Allocated initially.
   - Within a few seconds, StatsCollector resets them to Free in the UI / DB.
   - Try to deploy another GPU VM → fails because CloudStack attempts to 
reallocate the same GPU device.
   
   **Database output :**
   
   mysql> select * from gpu_device;
   
+----+--------------------------------------+---------+-----------------+-------------+------+---------+-------+-----------+--------------+----------------------+-----------+---------------+
   | id | uuid                                 | card_id | vgpu_profile_id | 
bus_address | type | host_id | vm_id | numa_node | pci_root     | 
parent_gpu_device_id | state     | managed_state |
   
+----+--------------------------------------+---------+-----------------+-------------+------+---------+-------+-----------+--------------+----------------------+-----------+---------------+
   | 23 | a998e66c-3147-4018-bf4e-45c34445c847 |       1 |               1 | 
04:00.0     | PCI  |       4 |    54 | 0         | 0000:04:00.0 |               
  NULL | Allocated | Managed       |
   | 25 | 60b63179-0500-4f77-8a51-b837554f802b |       1 |               1 | 
43:00.0     | PCI  |       4 |    54 | 0         | 0000:43:00.0 |               
  NULL | Allocated | Managed       |
   
   ... output omitted for brevity
   
+----+--------------------------------------+---------+-----------------+-------------+------+---------+-------+-----------+--------------+----------------------+-----------+---------------+
   8 rows in set (0.00 sec)
   
   
   **After few second All the assigned GPU is free in UI and database.**
   
   mysql> select * from gpu_device;
   
+----+--------------------------------------+---------+-----------------+-------------+------+---------+-------+-----------+--------------+----------------------+-------+---------------+
   | id | uuid                                 | card_id | vgpu_profile_id | 
bus_address | type | host_id | vm_id | numa_node | pci_root     | 
parent_gpu_device_id | state | managed_state |
   
+----+--------------------------------------+---------+-----------------+-------------+------+---------+-------+-----------+--------------+----------------------+-------+---------------+
   | 23 | a998e66c-3147-4018-bf4e-45c34445c847 |       1 |               1 | 
04:00.0     | PCI  |       4 |  NULL | 0         | 0000:04:00.0 |               
  NULL | Free  | Managed       |
   | 25 | 60b63179-0500-4f77-8a51-b837554f802b |       1 |               1 | 
43:00.0     | PCI  |       4 |  NULL | 0         | 0000:43:00.0 |               
  NULL | Free  | Managed       |
   
+----+--------------------------------------+---------+-----------------+-------------+------+---------+-------+-----------+--------------+----------------------+-------+---------------+
   8 rows in set (0.00 sec)
   
   
   **Cloudstack manager log :**
   !
   2025-09-04 12:34:48,335 INFO  [o.a.c.f.j.i.AsyncJobMonitor] 
(API-Job-Executor-7:[ctx-4efa690c, job-2764]) (logid:21ffde37) Add job-2764 
into job monitoring
   2025-09-04 12:34:48,387 INFO  [o.a.c.a.DynamicRoleBasedAPIAccessChecker] 
(qtp1404565079-4416:[ctx-cdb8cbf8, ctx-51e20bb4]) (logid:54a46693) Account for 
user id 73618f4e-87c6-11f0-91aa-525400abd4cf is Root Admin or Domain Admin, all 
APIs are allowed.
   2025-09-04 12:34:48,508 INFO  [c.c.a.m.a.i.FirstFitRoutingAllocator] 
(API-Job-Executor-7:[ctx-4efa690c, job-2764, ctx-3f6350d8, 
FirstFitRoutingAllocator]) (logid:df61abd2)  Guest VM is requested with 
Custom[UEFI] Boot Type false
   2025-09-04 12:34:48,550 INFO  [c.c.d.DeploymentPlanningManagerImpl] 
(API-Job-Executor-7:[ctx-4efa690c, job-2764, ctx-3f6350d8]) (logid:df61abd2) 
Re-ordering hosts [Host 
{"id":4,"name":"innmi1csh1-p002","type":"Routing","uuid":"3f67cdbc-cf17-4190-bd77-c6fb8ff41ecd"}]
 by priorities {}
   2025-09-04 12:34:48,556 INFO  [c.c.d.DeploymentPlanningManagerImpl] 
(API-Job-Executor-7:[ctx-4efa690c, job-2764, ctx-3f6350d8]) (logid:df61abd2) 
Hosts after re-ordering are: [Host 
{"id":4,"name":"innmi1csh1-p002","type":"Routing","uuid":"3f67cdbc-cf17-4190-bd77-c6fb8ff41ecd"}]
   2025-09-04 12:34:49,335 INFO  [c.c.c.ClusterManagerImpl] 
(Cluster-Heartbeat-1:[ctx-702df207]) (logid:59cd45c1) No inactive management 
server node found
   2025-09-04 12:34:49,855 INFO  [o.a.c.f.j.i.AsyncJobMonitor] 
(Work-Job-Executor-7:[ctx-e3263ad5, job-2764/job-2765]) (logid:2ee18b91) Add 
job-2765 into job monitoring
   2025-09-04 12:34:50,191 INFO  [c.c.n.e.VpcVirtualRouterElement] 
(Work-Job-Executor-7:[ctx-e3263ad5, job-2764/job-2765, ctx-05034172]) 
(logid:df61abd2) Adding VPC routers to Guest Network: 1 to be added!
   2025-09-04 12:34:50,829 INFO  [c.c.c.ClusterManagerImpl] 
(Cluster-Heartbeat-1:[ctx-603277ba]) (logid:d89f668b) No inactive management 
server node found
   2025-09-04 12:34:51,170 INFO  [o.a.c.a.DynamicRoleBasedAPIAccessChecker] 
(qtp1404565079-4455:[ctx-190018db, ctx-0dae51cf]) (logid:7fe357da) Account for 
user id 73618f4e-87c6-11f0-91aa-525400abd4cf is Root Admin or Domain Admin, all 
APIs are allowed.
   2025-09-04 12:34:51,481 INFO  [o.a.c.a.DynamicRoleBasedAPIAccessChecker] 
(qtp1404565079-2154:[ctx-9019da22, ctx-d82cc3c0]) (logid:a310b4d8) Account for 
user id 73618f4e-87c6-11f0-91aa-525400abd4cf is Root Admin or Domain Admin, all 
APIs are allowed.
   2025-09-04 12:34:51,813 INFO  [o.a.c.g.GpuServiceImpl] 
(Work-Job-Executor-7:[ctx-e3263ad5, job-2764/job-2765, ctx-05034172]) 
(logid:df61abd2) Allocated 2 GPU devices using single NUMA node strategy
   2025-09-04 12:34:52,333 INFO  [c.c.c.ClusterManagerImpl] 
(Cluster-Heartbeat-1:[ctx-12b50656]) (logid:30f02e07) No inactive management 
server node found
   2025-09-04 12:34:53,828 INFO  [c.c.c.ClusterManagerImpl] 
(Cluster-Heartbeat-1:[ctx-643657a7]) (logid:b608bd4b) No inactive management 
server node found
   2025-09-04 12:34:54,674 INFO  [o.a.c.a.DynamicRoleBasedAPIAccessChecker] 
(qtp1404565079-4416:[ctx-b6f0da0e, ctx-0f091485]) (logid:616fce75) Account for 
user id 73618f4e-87c6-11f0-91aa-525400abd4cf is Root Admin or Domain Admin, all 
APIs are allowed.
   2025-09-04 12:34:55,354 INFO  [c.c.c.ClusterManagerImpl] 
(Cluster-Heartbeat-1:[ctx-0e046f30]) (logid:1d29c60a) No inactive management 
server node found
   2025-09-04 12:34:55,949 INFO  [o.a.c.f.j.i.AsyncJobMonitor] 
(Work-Job-Executor-7:[ctx-e3263ad5, job-2764/job-2765]) (logid:df61abd2) Remove 
job-2765 from job monitoring
   2025-09-04 12:34:56,114 INFO  [o.a.c.f.j.i.AsyncJobMonitor] 
(API-Job-Executor-7:[ctx-4efa690c, job-2764]) (logid:df61abd2) Remove job-2764 
from job monitoring
   2025-09-04 12:34:56,260 INFO  [o.a.c.a.DynamicRoleBasedAPIAccessChecker] 
(qtp1404565079-4455:[ctx-440ad258, ctx-a3c60023]) (logid:0ed16d6e) Account for 
user id 73618f4e-87c6-11f0-91aa-525400abd4cf is Root Admin or Domain Admin, all 
APIs are allowed.
   2025-09-04 12:34:56,835 INFO  [c.c.c.ClusterManagerImpl] 
(Cluster-Heartbeat-1:[ctx-cbc66d8e]) (logid:0f7bde7d) No inactive management 
server node found
   2025-09-04 12:34:57,837 INFO  [o.a.c.a.DynamicRoleBasedAPIAccessChecker] 
(qtp1404565079-2154:[ctx-75b28a2e, ctx-a21ffd1b]) (logid:b2b9e83e) Account for 
user id 73618f4e-87c6-11f0-91aa-525400abd4cf is Root Admin or Domain Admin, all 
APIs are allowed.
   2025-09-04 12:34:58,087 INFO  [o.a.c.a.DynamicRoleBasedAPIAccessChecker] 
(qtp1404565079-4416:[ctx-aff94480, ctx-b8607cb3]) (logid:c2e9a913) Account for 
user id 73618f4e-87c6-11f0-91aa-525400abd4cf is Root Admin or Domain Admin, all 
APIs are allowed.
   2025-09-04 12:34:58,330 INFO  [c.c.c.ClusterManagerImpl] 
(Cluster-Heartbeat-1:[ctx-3c3d3284]) (logid:17037509) No inactive management 
server node found
   2025-09-04 12:34:58,987 WARN  [c.c.u.s.Script] 
(StatsCollector-6:[ctx-c8ab652c]) (logid:1ab365c9) Execution of process 
[162515] for command [/bin/bash -c systemctl status cloudstack-usage | grep "  
Loaded:" ] failed.
   2025-09-04 12:34:58,987 WARN  [c.c.u.s.Script] 
(StatsCollector-6:[ctx-c8ab652c]) (logid:1ab365c9) Process [162515] for command 
[/bin/bash -c systemctl status cloudstack-usage | grep "  Loaded:" ] 
encountered the error: [Unit cloudstack-usage.service could not be found.].
   2025-09-04 12:34:59,020 INFO  [c.c.s.S.ManagementServerCollector] 
(StatsCollector-6:[ctx-c8ab652c]) (logid:1ab365c9) system memory from /proc: 
16371363840
   2025-09-04 12:34:59,031 INFO  [c.c.s.S.ManagementServerCollector] 
(StatsCollector-6:[ctx-c8ab652c]) (logid:1ab365c9) free memory from /proc: 
11027075072
   2025-09-04 12:34:59,076 INFO  [c.c.s.S.ManagementServerCollector] 
(StatsCollector-6:[ctx-c8ab652c]) (logid:1ab365c9) used memory from /proc: 
1446256
   2025-09-04 12:34:59,243 INFO  [c.c.v.d.VmStatsDaoImpl] 
(StatsCollector-2:[ctx-cb2988fc]) (logid:bbfaa994) Removed a total of [3] 
vm_stats rows older than [Thu Sep 04 00:34:59 IST 2025].
   2025-09-04 12:34:59,833 INFO  [c.c.c.ClusterManagerImpl] 
(Cluster-Heartbeat-1:[ctx-b506ddae]) (logid:53a19df8) No inactive management 
server node found
   2025-09-04 12:35:01,332 INFO  [c.c.c.ClusterManagerImpl] 
(Cluster-Heartbeat-1:[ctx-78eeb1b3]) (logid:3bae1866) No inactive management 
server node found
   2025-09-04 12:35:02,830 INFO  [c.c.c.ClusterManagerImpl] 
(Cluster-Heartbeat-1:[ctx-9e1bba38]) (logid:8700e919) No inactive management 
server node found
   2025-09-04 12:35:03,189 INFO  [o.a.c.v.s.VMSchedulerImpl] 
(VMSchedulerPollTask:[ctx-aacd3167]) (logid:f674a6bf) Cleaned up 0 VM scheduled 
job entries
   2025-09-04 12:35:03,343 INFO  [o.a.c.a.DynamicRoleBasedAPIAccessChecker] 
(qtp1404565079-4455:[ctx-801f00d3, ctx-397e56a1]) (logid:57231ce1) Account for 
user id 73618f4e-87c6-11f0-91aa-525400abd4cf is Root Admin or Domain Admin, all 
APIs are allowed.
   2025-09-04 12:35:04,347 INFO  [c.c.c.ClusterManagerImpl] 
(Cluster-Heartbeat-1:[ctx-d836e142]) (logid:2766c6d5) No inactive management 
server node found
   2025-09-04 12:35:05,834 INFO  [c.c.c.ClusterManagerImpl] 
(Cluster-Heartbeat-1:[ctx-6c3a521a]) (logid:c8463fc5) No inactive management 
server node found
   2025-09-04 12:35:07,332 INFO  [c.c.c.ClusterManagerImpl] 
(Cluster-Heartbeat-1:[ctx-54f84601]) (logid:1d275b17) No inactive management 
server node found
   2025-09-04 12:35:08,841 INFO  [c.c.c.ClusterManagerImpl] 
(Cluster-Heartbeat-1:[ctx-89fe221f]) (logid:1f8850f9) No inactive management 
server node found
   2025-09-04 12:35:10,332 INFO  [c.c.c.ClusterManagerImpl] 
(Cluster-Heartbeat-1:[ctx-0c25c8ff]) (logid:cb689ff8) No inactive management 
server node found
   2025-09-04 12:35:11,833 INFO  [c.c.c.ClusterManagerImpl] 
(Cluster-Heartbeat-1:[ctx-7267c603]) (logid:bb6187c9) No inactive management 
server node found
   
   
   2025-09-04 12:35:13,102 INFO  [o.a.c.a.DynamicRoleBasedAPIAccessChecker] 
(qtp1404565079-2154:[ctx-b19f2b1d, ctx-ebc7afdd]) (logid:39fdcc01) Account for 
user id 73618f4e-87c6-11f0-91aa-525400abd4cf is Root Admin or Domain Admin, all 
APIs are allowed.
   2025-09-04 12:35:13,331 INFO  [c.c.c.ClusterManagerImpl] 
(Cluster-Heartbeat-1:[ctx-6fb48541]) (logid:f8cdbf10) No inactive management 
server node found
   2025-09-04 12:35:13,376 WARN  [o.a.c.g.GpuServiceImpl] 
(StatsCollector-5:[ctx-3ae1d26a]) (logid:0d5dbfa5) VM with ID 54 not found for 
GPU device GpuDevice 
{"busAddress":"04:00.0","cardId":1,"hostId":4,"id":23,"numaNode":"0","parentGpuDeviceId":null,"pciRoot":"0000:04:00.0","state":"Allocated","uuid":"a998e66c-3147-4018-bf4e-45c34445c847","vgpuProfileId":1,"vmId":54}.
 Allocated to a removed VM. Setting state to Free.
   2025-09-04 12:35:13,432 WARN  [o.a.c.g.GpuServiceImpl] 
(StatsCollector-5:[ctx-3ae1d26a]) (logid:0d5dbfa5) VM with ID 54 not found for 
GPU device GpuDevice 
{"busAddress":"43:00.0","cardId":1,"hostId":4,"id":25,"numaNode":"0","parentGpuDeviceId":null,"pciRoot":"0000:43:00.0","state":"Allocated","uuid":"60b63179-0500-4f77-8a51-b837554f802b","vgpuProfileId":1,"vmId":54}.
 Allocated to a removed VM. Setting state to Free.
   2025-09-04 12:35:14,851 INFO  [c.c.c.ClusterManagerImpl] 
(Cluster-Heartbeat-1:[ctx-0181189b]) (logid:15900e7b) No inactive management 
server node found
   
   **VM status in DB:**
   
   mysql> SELECT * FROM vm_instance WHERE id = 54;
   
+----+---------------+--------------------------------------+---------------+---------+----------------+-------------+---------------------+--------------------+--------+----------------+---------+--------------+----------+---------------------+----------------------------------------------------------------------+------------+---------------+--------------+---------------------+---------------------+---------+------+---------+------------+-----------+---------------------+--------------------------------------+-----------------+-------+---------------+---------------+---------------+----------------------+------------+-------------+-------------------------+--------------------------+------------+---------+--------------------+--------------------+----------------+-------------------+
   | id | name          | uuid                                 | instance_name 
| state   | vm_template_id | guest_os_id | private_mac_address | 
private_ip_address | pod_id | data_center_id | host_id | last_host_id | 
proxy_id | proxy_assign_time   | vnc_password                                   
                      | ha_enabled | limit_cpu_use | update_count | update_time 
        | created             | removed | type | vm_type | account_id | 
domain_id | service_offering_id | reservation_id                       | 
hypervisor_type | owner | host_name     | display_name  | desired_state | 
dynamically_scalable | display_vm | power_state | power_state_update_time | 
power_state_update_count | power_host | user_id | backup_offering_id | 
backup_external_id | backup_volumes | delete_protection |
   
+----+---------------+--------------------------------------+---------------+---------+----------------+-------------+---------------------+--------------------+--------+----------------+---------+--------------+----------+---------------------+----------------------------------------------------------------------+------------+---------------+--------------+---------------------+---------------------+---------+------+---------+------------+-----------+---------------------+--------------------------------------+-----------------+-------+---------------+---------------+---------------+----------------------+------------+-------------+-------------------------+--------------------------+------------+---------+--------------------+--------------------+----------------+-------------------+
   | 54 | vmwith2gpu-01 | 22c80f57-84ad-415a-9a61-f027aabcc4b1 | i-2-54-VM     
| Running |            204 |         381 | 02:01:00:cc:00:10   | 10.1.1.83      
    |      1 |              1 |       4 |            4 |       38 | 2025-09-03 
07:43:26 | b/Aqozg0vfODSYaDiU9P9emMyZzEeMx7Y/8q8I5fWvm+bnzMDX86LOQo/m/holKKv3Y= 
|          0 |             0 |           23 | 2025-09-04 07:04:55 | 2025-09-03 
07:42:15 | NULL    | User | User    |          2 |         1 |                  
14 | fe287204-fa5b-44a3-8542-5c4d10b3f24d | KVM             | 2     | 
vmwith2gpu-01 | vmwith2gpu-01 | NULL          |                    0 |          
1 | PowerOn     | 2025-09-04 12:01:09     |                        2 |          
4 |       2 |               NULL | NULL               | NULL           |        
         0 |
   
+----+---------------+--------------------------------------+---------------+---------+----------------+-------------+---------------------+--------------------+--------+----------------+---------+--------------+----------+---------------------+----------------------------------------------------------------------+------------+---------------+--------------+---------------------+---------------------+---------+------+---------+------------+-----------+---------------------+--------------------------------------+-----------------+-------+---------------+---------------+---------------+----------------------+------------+-------------+-------------------------+--------------------------+------------+---------+--------------------+--------------------+----------------+-------------------+
   1 row in set (0.00 sec)
   
   
   
   
   ### versions
   
   The versions of ACS, hypervisors, storage, network etc..
   
   ### The steps to reproduce the bug
   
   1.
   2.
   3.
   ...
   
   
   ### What to do about it?
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@cloudstack.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to