rhtyd commented on issue #2890: KVMHAMonitor thread blocks indefinitely while NFS not available URL: https://github.com/apache/cloudstack/issues/2890#issuecomment-434207830 @csquire I could not reproduce the issue. I guess it depends on how you're simulating NFS/shared-storage failure and what your iptables rules are. In my case I simply shutdown the NFS server instead of iptables rules and observed following: ``` . 2018-10-30 08:04:04,842 WARN [kvm.resource.KVMHAMonitor] (Thread-34:null) (logid:) Timed out: /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh -i 172.20.0.1 -p /export/testing/primary -m /mnt/05c30f6d-5725-3369-a8d6-b5fb37c7ba8f -h 172.20.1.10 . Output is: 2018-10-30 08:04:04,843 WARN [kvm.resource.KVMHAMonitor] (Thread-34:null) (logid:) write heartbeat failed: timeout, try: 5 of 5 2018-10-30 08:04:04,843 DEBUG [kvm.resource.KVMHAMonitor] (Thread-34:null) (logid:) [ignored] interupted between heartbeat retries. 2018-10-30 08:04:04,843 WARN [kvm.resource.KVMHAMonitor] (Thread-34:null) (logid:) write heartbeat failed: timeout; stopping cloudstack-agent 2018-10-30 08:04:04,843 DEBUG [kvm.resource.KVMHAMonitor] (Thread-34:null) (logid:) Executing: /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh -i 172.20.0.1 -p /export/testing/primary -m /mnt/05c30f6d-5725-3369-a8d6-b5fb37c7ba8f -c 2018-10-30 08:04:04,845 DEBUG [kvm.resource.KVMHAMonitor] (Thread-34:null) (logid:) Executing while with timeout : 60000 2018-10-30 08:04:05,007 DEBUG [kvm.resource.LibvirtComputingResource] (UgentTask-5:null) (logid:) Executing: /usr/share/cloudstack-common/scripts/vm/network/security_group.py get_rule_logs_for_vms 2018-10-30 08:04:05,009 DEBUG [kvm.resource.LibvirtComputingResource] (UgentTask-5:null) (logid:) Executing while with timeout : 1800000 2018-10-30 08:04:05,109 DEBUG [kvm.resource.LibvirtComputingResource] (UgentTask-5:null) (logid:) Execution is successful. 2018-10-30 08:04:05,109 DEBUG [kvm.resource.LibvirtConnection] (UgentTask-5:null) (logid:) Looking for libvirtd connection at: qemu:///system 2018-10-30 08:04:05,114 DEBUG [cloud.agent.Agent] (UgentTask-5:null) (logid:) Sending ping: Seq 1-157: { Cmd , MgmtId: -1, via: 1, Ver: v1, Flags: 11, [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupStates":{},"_hostVmStateReport":{"r-5-VM":{"state":"PowerOn","host":"centos7-kvm1"},"s-1-VM":{"state":"PowerOn","host":"centos7-kvm1"},"v-2-VM":{"state":"PowerOn","host":"centos7-kvm1"},"i-2-4-VM":{"state":"PowerOn","host":"centos7-kvm1"}},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType":"Routing","hostId":1,"wait":0}}] } 2018-10-30 08:04:05,166 DEBUG [cloud.agent.Agent] (Agent-Handler-3:null) (logid:) Received response: Seq 1-157: { Ans: , MgmtId: 2485476061376, via: 1, Ver: v1, Flags: 100010, [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","hostId":1,"wait":0},"result":true,"wait":0}}] } 2018-10-30 08:04:06,762 DEBUG [cloud.agent.Agent] (agentRequest-Handler-5:null) (logid:0aa26d2d) Request:Seq 1-3303108851699548248: { Cmd , MgmtId: 2485476061376, via: 1, Ver: v1, Flags: 100011, [{"com.cloud.agent.api.GetVncPortCommand":{"id":4,"name":"i-2-4-VM","wait":0}}] } 2018-10-30 08:04:06,762 DEBUG [cloud.agent.Agent] (agentRequest-Handler-5:null) (logid:0aa26d2d) Processing command: com.cloud.agent.api.GetVncPortCommand 2018-10-30 08:04:06,762 DEBUG [kvm.resource.LibvirtConnection] (agentRequest-Handler-5:null) (logid:0aa26d2d) Looking for libvirtd connection at: qemu:///system 2018-10-30 08:04:06,765 DEBUG [cloud.agent.Agent] (agentRequest-Handler-5:null) (logid:0aa26d2d) Seq 1-3303108851699548248: { Ans: , MgmtId: 2485476061376, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.GetVncPortAnswer":{"address":"172.20.1.10","port":5903,"result":true,"wait":0}}] } 2018-10-30 08:04:09,871 DEBUG [kvm.resource.KVMHAMonitor] (Thread-34:null) (logid:) Exit value is 143 2018-10-30 08:04:09,872 DEBUG [kvm.resource.KVMHAMonitor] (Thread-34:null) (logid:) Redirecting to /bin/systemctl stop cloudstack-agent.service 2018-10-30 08:04:09,873 INFO [cloud.agent.Agent] (AgentShutdownThread:null) (logid:) Stopping the agent: Reason = sig.kill 2018-10-30 08:04:09,874 DEBUG [cloud.agent.Agent] (AgentShutdownThread:null) (logid:) Sending shutdown to management server 2018-10-30 08:04:09,876 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Location 1: Socket Socket[addr=/172.20.0.1,port=8250,localport=43646] closed on read. Probably -1 returned: Connection closed with -1 on reading size. 2018-10-30 08:04:09,876 DEBUG [utils.nio.NioConnection] (Agent-NioConnectionHandler-1:null) (logid:) Closing socket Socket[addr=/172.20.0.1,port=8250,localport=43646] 2018-10-30 08:04:10,875 DEBUG [kvm.resource.LibvirtConnection] (AgentShutdownThread:null) (logid:) Looking for libvirtd connection at: qemu:///system ``` I could see the `kvmheartbeat.sh` try 5 times every 60 seconds and eventually failing to shutdown the agent by this: `/bin/systemctl stop cloudstack-agent.service`. On the management server side, I saw the host state go from Up to Disconnected: ![screenshot from 2018-10-30 13-38-20](https://user-images.githubusercontent.com/95203/47704265-223dd100-dc49-11e8-9f71-1aa36e66e47d.png) The only change in behaviour I see is that the KVM host is not rebooted but only the agent gets shutdown.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services