rajujith opened a new issue, #10477:
URL: https://github.com/apache/cloudstack/issues/10477

   ### problem
   
   With the default configurations, CloudStack determines a KVM host is down in 
15-20 minutes. The HA-enabled instances will be started on another host only 
after this process. While reviewing the delay for the host state investigation 
followed by a ping timeout I see one command that takes 10 minutes 
**'com.cloud.agent.api.CheckOnHostCommand** printing in the logs the following 
message 'timed out after 3600'. Later the host is determined as down via the 
neighbouring host quickly. 
   
   I suspect there is some issue in this specific implementation and if fixed 
the VM HA delay in KVM could be reduced by 10 minutes. 
   
   ```
   2025-01-28 06:22:30,041 DEBUG [c.c.a.t.Request] 
(AgentTaskPool-4:ctx-1ece8add) (logid:49d74f9a) Seq 2-4979573812988215360: 
Sending  { Cmd , MgmtId: 32988184186020, via: 
2(ref-trl-5786-k-Mu22-jithin-raju-kvm2), Ver: v1, Flags: 100011, 
[{"com.cloud.agent.api.CheckOnHostCommand":{"host":{"guid":"439751ba-a6eb-3103-b60d-8321f53224fb-LibvirtComputingResource","privateNetwork":{"ip":"10.1.33.180","netmask":"255.255.240.0","mac":"1e:00:a9:00:0a:72","isSecurityGroupEnabled":"false"},"storageNetwork1":{"ip":"10.1.33.180","netmask":"255.255.240.0","mac":"1e:00:a9:00:0a:72","isSecurityGroupEnabled":"false"}},"reportCheckFailureIfOneStorageIsDown":"false","wait":"0","bypassHostMaintenance":"false"}}]
 }
   2025-01-28 06:32:14,792 DEBUG [c.c.a.m.AgentAttache] 
(AgentTaskPool-4:ctx-1ece8add) (logid:49d74f9a) Seq 2-4979573812988215360: 
Waiting some more time because this is the current command
   2025-01-28 06:32:14,792 DEBUG [c.c.a.m.AgentAttache] 
(AgentTaskPool-4:ctx-1ece8add) (logid:49d74f9a) Seq 2-4979573812988215360: 
Waiting some more time because this is the current command
   2025-01-28 06:32:14,792 WARN  [c.c.a.m.AgentAttache] 
(AgentTaskPool-4:ctx-1ece8add) (logid:49d74f9a) Seq 2-4979573812988215360: 
Timed out on Seq 2-4979573812988215360:  { Cmd , MgmtId: 32988184186020, via: 
2(ref-trl-5786-k-Mu22-jithin-raju-kvm2), Ver: v1, Flags: 100011, 
[{"com.cloud.agent.api.CheckOnHostCommand":{"host":{"guid":"439751ba-a6eb-3103-b60d-8321f53224fb-LibvirtComputingResource","privateNetwork":{"ip":"10.1.33.180","netmask":"255.255.240.0","mac":"1e:00:a9:00:0a:72","isSecurityGroupEnabled":"false"},"storageNetwork1":{"ip":"10.1.33.180","netmask":"255.255.240.0","mac":"1e:00:a9:00:0a:72","isSecurityGroupEnabled":"false"}},"reportCheckFailureIfOneStorageIsDown":"false","wait":"0","bypassHostMaintenance":"false"}}]
 }
   2025-01-28 06:32:14,793 DEBUG [c.c.a.m.AgentAttache] 
(AgentTaskPool-4:ctx-1ece8add) (logid:49d74f9a) Seq 2-4979573812988215360: 
Cancelling.
   2025-01-28 06:32:14,793 WARN  [c.c.a.m.AgentManagerImpl] 
(AgentTaskPool-4:ctx-1ece8add) (logid:49d74f9a) Operation timed out: Commands 
4979573812988215360 to Host 2 timed out after 3600
   ```
   
   
https://gist.github.com/rajujith/9a51c52163eb4862b497057a40e8b812#file-acs-kvm-vm-ha-host-down
   
   ### versions
   
   4.19.1.3
   
   ### The steps to reproduce the bug
   
   1. Power off the host via ILO/IDRAC or power off the nested hypervisor 
through the base hypervisor.
   2. Observe the delay in VM HA and review the logs
   
   ...
   
   
   ### What to do about it?
   
   Reduce the delay in the VM HA on KVM. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to