Thomas Heil created CLOUDSTACK-10393: ----------------------------------------
Summary: VM doest not restart after Host Power Failure under KVM Centos7 Key: CLOUDSTACK-10393 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10393 Project: CloudStack Issue Type: Bug Security Level: Public (Anyone can view this level - this is the default.) Components: cloudstack-agent, eventbus, KVM Affects Versions: 4.11.1.0 Environment: - Centos 7 Management Server with 6 Hosts, all in one advanced zone with one cluster - All host have out of band configured via ipmi and its working Reporter: Thomas Heil Fix For: 4.12, 4.11 HA Vm's are not restarted after power failure on Host. It also fails to restart the system-vms (Ive changed their offering to HA). The status for the HA-VM's does not change, even the hypervisor is dead. When I crash one host in the Cluster that contains severeal vms, especially the system vms, nothing happens. They just dont get restartet. Here is the first snippet from the log -- 018-08-15 19:47:15,019 INFO [c.c.a.m.AgentManagerImpl] (AgentTaskPool-10:ctx-56350724) (logid:83436643) Investigating why host Host s-217-VM (id:37) has disconnected with event AgentDisconnected 2018-08-15 19:47:15,020 INFO [c.c.a.m.AgentManagerImpl] (AgentTaskPool-10:ctx-56350724) (logid:83436643) Status for Host s-217-VM (id:37) was Connecting. Investigation determined the current state is Alert -- So Cloudstack recognizes the host ist down. The next snippet from the log -- 2018-08-15 20:02:35,169 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-81d4913b work-34) (logid:d1e40a95) SimpleInvestigator could not find VM[SecondaryStorageVm|s-217-VM] 2018-08-15 20:02:35,169 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-81d4913b work-34) (logid:d1e40a95) XenServerInvestigator could not find VM[SecondaryStorageVm|s-217-VM] 2018-08-15 20:02:35,172 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-81d4913b work-34) (logid:d1e40a95) KVMInvestigator found VM[SecondaryStorageVm|s-217-VM] to be alive? true 2018-08-15 20:02:35,172 DEBUG [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-81d4913b work-34) (logid:d1e40a95) VM s-217-VM is found to be alive by KVMInvestigator 2018-08-15 20:02:35,172 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-81d4913b work-34) (logid:d1e40a95) Rescheduling work HAWork[34-HA-217-Running-Investigating] to try again at Wed Aug 15 20:03:35 CEST 2018 2018-08-15 20:04:35,175 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-2f21e97a work-34) (logid:ae0b3b16) Processing work HAWork[34-HA-217-Running-Investigating] 2018-08-15 20:04:35,180 DEBUG [c.c.h.CheckOnAgentInvestigator] (HA-Worker-1:ctx-2f21e97a work-34) (logid:ae0b3b16) Unable to reach the agent for VM[SecondaryStorageVm|s-217-VM]: Resource [Host:31] is unreachable: Host 31: Host with specified id is not in the right state: Down 2018-08-15 20:04:35,180 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-2f21e97a work-34) (logid:ae0b3b16) SimpleInvestigator could not find VM[SecondaryStorageVm|s-217-VM] 2018-08-15 20:04:35,180 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-2f21e97a work-34) (logid:ae0b3b16) XenServerInvestigator could not find VM[SecondaryStorageVm|s-217-VM] 2018-08-15 20:04:35,183 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-2f21e97a work-34) (logid:ae0b3b16) KVMInvestigator found VM[SecondaryStorageVm|s-217-VM] to be alive? true 2018-08-15 20:04:35,183 DEBUG [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-2f21e97a work-34) (logid:ae0b3b16) VM s-217-VM is found to be alive by KVMInvestigator 2018-08-15 20:04:35,183 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-1:ctx-2f21e97a work-34) (logid:ae0b3b16) Rescheduling work HAWork[34-HA-217-Running-Investigating] to try again at Wed Aug 15 20:05:35 CEST 2018 2018-08-15 20:06:35,184 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-0:ctx-053d5a41 work-34) (logid:872583f0) Processing work HAWork[34-HA-217-Running-Investigating] 2018-08-15 20:06:35,188 DEBUG [c.c.h.CheckOnAgentInvestigator] (HA-Worker-0:ctx-053d5a41 work-34) (logid:872583f0) Unable to reach the agent for VM[SecondaryStorageVm|s-217-VM]: Resource [Host:31] is unreachable: Host 31: Host with specified id is not in the right state: Down 2018-08-15 20:06:35,189 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-0:ctx-053d5a41 work-34) (logid:872583f0) SimpleInvestigator could not find VM[SecondaryStorageVm|s-217-VM] 2018-08-15 20:06:35,189 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-0:ctx-053d5a41 work-34) (logid:872583f0) XenServerInvestigator could not find VM[SecondaryStorageVm|s-217-VM] 2018-08-15 20:06:35,191 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-0:ctx-053d5a41 work-34) (logid:872583f0) KVMInvestigator found VM[SecondaryStorageVm|s-217-VM] to be alive? true 2018-08-15 20:06:35,191 DEBUG [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-0:ctx-053d5a41 work-34) (logid:872583f0) VM s-217-VM is found to be alive by KVMInvestigator 2018-08-15 20:06:35,191 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-0:ctx-053d5a41 work-34) (logid:872583f0) Rescheduling work HAWork[34-HA-217-Running-Investigating] to try again at Wed Aug 15 20:07:36 CEST 2018 2018-08-15 20:06:35,191 WARN [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-0:ctx-053d5a41 work-34) (logid:872583f0) Giving up, retried max. times for work: HAWork[34-HA-217-Running-Investigating] 2018-08-15 21:40:03,884 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-3:ctx-bc7feed5 work-35) (logid:f03140ba) Processing work HAWork[35-HA-217-Running-Investigating] 2018-08-15 21:40:03,903 DEBUG [c.c.h.CheckOnAgentInvestigator] (HA-Worker-3:ctx-bc7feed5 work-35) (logid:f03140ba) Agent responded with state PowerOff 2018-08-15 21:40:03,903 INFO [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-3:ctx-bc7feed5 work-35) (logid:f03140ba) SimpleInvestigator found VM[SecondaryStorageVm|s-217-VM] to be alive? false -- Here the problems seem to be that the "Investigator" claims the VM to be alive even the host is down. All host have ipmi configured. So i would expect Cloudstack Management should shoot up the node in the head via ipmi powerdown the ensure there is not connection problem. Then the state of all VM's should be updated as "stopped". Now a new eligable host should be chosen to start the vm's again. For me it seems this issue recurring from time to time. So i would be really interested in how to create a Simulator for this and we clould just write a test case? Iam not sure if the ipmi stuff can be emulated with the simulator. This Bug could be simmilar or related to CLOUDSTACK-10246 and CLOUDSTACK-8713. Ive tested CLOUDSTACK-10246 and it does not solve the issue here. -- This message was sent by Atlassian JIRA (v7.6.3#76005)