somejfn commented on issue #2890: KVMHAMonitor thread blocks indefinitely while NFS not available URL: https://github.com/apache/cloudstack/issues/2890#issuecomment-432374446 Confirmed we see similar behavior on 4.11.2rc3 and the agent went in Down state. Agent logs: 810986-e702-36ea-a87b-fd48064ecb12 2018-10-23 13:14:40,391 INFO [kvm.resource.LibvirtConnection] (agentRequest-Handler-4:null) (logid:f8cd7cf7) No existing libvirtd connection found. Opening a new one 2018-10-23 13:14:40,392 WARN [kvm.resource.LibvirtConnection] (agentRequest-Handler-4:null) (logid:f8cd7cf7) Can not find a connection for Instance i-4-24-VM. Assuming the default connection. 2018-10-23 13:14:40,399 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-4:null) (logid:f8cd7cf7) Trying to fetch storage pool 4e49054a-463f-306f-9678-b0d9b02af9a1 from libvirt 2018-10-23 13:14:51,496 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-2:null) (logid:3a0df8e5) Trying to fetch storage pool 0e233ec5-ea14-439e-bfde-a8c7566d254c from libvirt 2018-10-23 13:14:51,498 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-2:null) (logid:3a0df8e5) Asking libvirt to refresh storage pool 0e233ec5-ea14-439e-bfde-a8c7566d254c 2018-10-23 13:15:25,027 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-1:null) (logid:581a1d95) Trying to fetch storage pool 0e233ec5-ea14-439e-bfde-a8c7566d254c from libvirt 2018-10-23 13:15:25,029 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-1:null) (logid:581a1d95) Asking libvirt to refresh storage pool 0e233ec5-ea14-439e-bfde-a8c7566d254c 2018-10-23 13:15:25,590 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-5:null) (logid:581a1d95) Trying to fetch storage pool 3e810986-e702-36ea-a87b-fd48064ecb12 from libvirt 2018-10-23 13:15:25,592 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-5:null) (logid:581a1d95) Asking libvirt to refresh storage pool 3e810986-e702-36ea-a87b-fd48064ecb12 2018-10-23 13:21:28,804 WARN [kvm.resource.KVMHAChecker] (Script-3:null) (logid:) Interrupting script. 2018-10-23 13:21:28,806 WARN [kvm.resource.KVMHAChecker] (pool-15160-thread-1:null) (logid:c3d5dcaf) Timed out: /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh -i 10.73.96.232 -p /vol/t500_0_fls3_pool36_root -m /mnt/d05f1c9d-9454-3707-a6c4-781398af198d -h 10.73.96.212 -r -t 60 . Output is: 2018-10-23 13:21:32,826 WARN [kvm.resource.KVMHAChecker] (Script-7:null) (logid:) Interrupting script. 2018-10-23 13:21:32,827 WARN [kvm.resource.KVMHAChecker] (pool-15161-thread-1:null) (logid:c3d5dcaf) Timed out: /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh -i 10.73.96.232 -p /vol/t500_0_fls3_pool36_root -m /mnt/d05f1c9d-9454-3707-a6c4-781398af198d -h 10.73.96.212 -r -t 60 . Output is: 2018-10-23 13:21:36,846 WARN [kvm.resource.KVMHAChecker] (Script-4:null) (logid:) Interrupting script. 2018-10-23 13:21:36,847 WARN [kvm.resource.KVMHAChecker] (pool-15162-thread-1:null) (logid:4a3cb34f) Timed out: /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh -i 10.73.96.232 -p /vol/t500_0_fls3_pool36_root -m /mnt/d05f1c9d-9454-3707-a6c4-781398af198d -h 10.73.96.212 -r -t 60 . Output is: 2018-10-23 13:24:44,205 INFO [cloud.agent.Agent] (Agent-Handler-1:null) (logid:5a5a7500) Lost connection to host: 10.73.96.19. Attempting reconnection while we still have 5 commands in progress. 2018-10-23 13:24:44,206 INFO [utils.nio.NioClient] (Agent-Handler-1:null) (logid:5a5a7500) NioClient connection closed 2018-10-23 13:24:44,206 INFO [cloud.agent.Agent] (Agent-Handler-1:null) (logid:5a5a7500) Reconnecting to host:10.73.96.19 2018-10-23 13:24:44,207 INFO [utils.nio.NioClient] (Agent-Handler-1:null) (logid:5a5a7500) Connecting to 10.73.96.19:8250 2018-10-23 13:24:44,207 INFO [utils.nio.Link] (Agent-Handler-1:null) (logid:5a5a7500) Conf file found: /etc/cloudstack/agent/agent.properties Note sometimes you will see the agent successfully go in Disconnect state but the host HA framework might still fire after the kvm.ha.degraded.max.period timer and that is not expected. In any case we want to avoid massive KVM host resets via IPMI for storage related problems because this is more damaging than waiting to primary storage to come back.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services