Thomas Heil created CLOUDSTACK-10393:
----------------------------------------

             Summary: VM doest not restart after Host Power Failure under KVM 
Centos7
                 Key: CLOUDSTACK-10393
                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10393
             Project: CloudStack
          Issue Type: Bug
      Security Level: Public (Anyone can view this level - this is the default.)
          Components: cloudstack-agent, eventbus, KVM
    Affects Versions: 4.11.1.0
         Environment:  - Centos 7 Management Server with 6 Hosts, all in one 
advanced zone with one cluster
 - All host have out of band configured via ipmi and its working

            Reporter: Thomas Heil
             Fix For: 4.12, 4.11


 

HA Vm's are not restarted after power failure on Host. It also fails to restart 
the system-vms (Ive changed their offering to HA).

The status for the HA-VM's does not change, even the hypervisor is dead.

 

When I crash one host in the Cluster that contains severeal vms, especially the 
system vms, nothing happens. They just dont get restartet.

Here is the first snippet from the log

--
018-08-15 19:47:15,019 INFO  [c.c.a.m.AgentManagerImpl] 
(AgentTaskPool-10:ctx-56350724) (logid:83436643) Investigating why host Host 
s-217-VM (id:37) has disconnected with event AgentDisconnected
2018-08-15 19:47:15,020 INFO  [c.c.a.m.AgentManagerImpl] 
(AgentTaskPool-10:ctx-56350724) (logid:83436643) Status for Host s-217-VM 
(id:37) was Connecting.  Investigation determined the current state is Alert
--

So Cloudstack recognizes the host ist down.

 

The next snippet from the log

--

2018-08-15 20:02:35,169 INFO  [c.c.h.HighAvailabilityManagerImpl] 
(HA-Worker-4:ctx-81d4913b work-34) (logid:d1e40a95) SimpleInvestigator could 
not find VM[SecondaryStorageVm|s-217-VM]
2018-08-15 20:02:35,169 INFO  [c.c.h.HighAvailabilityManagerImpl] 
(HA-Worker-4:ctx-81d4913b work-34) (logid:d1e40a95) XenServerInvestigator could 
not find VM[SecondaryStorageVm|s-217-VM]
2018-08-15 20:02:35,172 INFO  [c.c.h.HighAvailabilityManagerImpl] 
(HA-Worker-4:ctx-81d4913b work-34) (logid:d1e40a95) KVMInvestigator found 
VM[SecondaryStorageVm|s-217-VM] to be alive? true
2018-08-15 20:02:35,172 DEBUG [c.c.h.HighAvailabilityManagerImpl] 
(HA-Worker-4:ctx-81d4913b work-34) (logid:d1e40a95) VM s-217-VM is found to be 
alive by KVMInvestigator
2018-08-15 20:02:35,172 INFO  [c.c.h.HighAvailabilityManagerImpl] 
(HA-Worker-4:ctx-81d4913b work-34) (logid:d1e40a95) Rescheduling work 
HAWork[34-HA-217-Running-Investigating] to try again at Wed Aug 15 20:03:35 
CEST 2018
2018-08-15 20:04:35,175 INFO  [c.c.h.HighAvailabilityManagerImpl] 
(HA-Worker-1:ctx-2f21e97a work-34) (logid:ae0b3b16) Processing work 
HAWork[34-HA-217-Running-Investigating]
2018-08-15 20:04:35,180 DEBUG [c.c.h.CheckOnAgentInvestigator] 
(HA-Worker-1:ctx-2f21e97a work-34) (logid:ae0b3b16) Unable to reach the agent 
for VM[SecondaryStorageVm|s-217-VM]: Resource [Host:31] is unreachable: Host 
31: Host with specified id is not in the right state: Down
2018-08-15 20:04:35,180 INFO  [c.c.h.HighAvailabilityManagerImpl] 
(HA-Worker-1:ctx-2f21e97a work-34) (logid:ae0b3b16) SimpleInvestigator could 
not find VM[SecondaryStorageVm|s-217-VM]
2018-08-15 20:04:35,180 INFO  [c.c.h.HighAvailabilityManagerImpl] 
(HA-Worker-1:ctx-2f21e97a work-34) (logid:ae0b3b16) XenServerInvestigator could 
not find VM[SecondaryStorageVm|s-217-VM]
2018-08-15 20:04:35,183 INFO  [c.c.h.HighAvailabilityManagerImpl] 
(HA-Worker-1:ctx-2f21e97a work-34) (logid:ae0b3b16) KVMInvestigator found 
VM[SecondaryStorageVm|s-217-VM] to be alive? true
2018-08-15 20:04:35,183 DEBUG [c.c.h.HighAvailabilityManagerImpl] 
(HA-Worker-1:ctx-2f21e97a work-34) (logid:ae0b3b16) VM s-217-VM is found to be 
alive by KVMInvestigator
2018-08-15 20:04:35,183 INFO  [c.c.h.HighAvailabilityManagerImpl] 
(HA-Worker-1:ctx-2f21e97a work-34) (logid:ae0b3b16) Rescheduling work 
HAWork[34-HA-217-Running-Investigating] to try again at Wed Aug 15 20:05:35 
CEST 2018
2018-08-15 20:06:35,184 INFO  [c.c.h.HighAvailabilityManagerImpl] 
(HA-Worker-0:ctx-053d5a41 work-34) (logid:872583f0) Processing work 
HAWork[34-HA-217-Running-Investigating]
2018-08-15 20:06:35,188 DEBUG [c.c.h.CheckOnAgentInvestigator] 
(HA-Worker-0:ctx-053d5a41 work-34) (logid:872583f0) Unable to reach the agent 
for VM[SecondaryStorageVm|s-217-VM]: Resource [Host:31] is unreachable: Host 
31: Host with specified id is not in the right state: Down
2018-08-15 20:06:35,189 INFO  [c.c.h.HighAvailabilityManagerImpl] 
(HA-Worker-0:ctx-053d5a41 work-34) (logid:872583f0) SimpleInvestigator could 
not find VM[SecondaryStorageVm|s-217-VM]
2018-08-15 20:06:35,189 INFO  [c.c.h.HighAvailabilityManagerImpl] 
(HA-Worker-0:ctx-053d5a41 work-34) (logid:872583f0) XenServerInvestigator could 
not find VM[SecondaryStorageVm|s-217-VM]
2018-08-15 20:06:35,191 INFO  [c.c.h.HighAvailabilityManagerImpl] 
(HA-Worker-0:ctx-053d5a41 work-34) (logid:872583f0) KVMInvestigator found 
VM[SecondaryStorageVm|s-217-VM] to be alive? true
2018-08-15 20:06:35,191 DEBUG [c.c.h.HighAvailabilityManagerImpl] 
(HA-Worker-0:ctx-053d5a41 work-34) (logid:872583f0) VM s-217-VM is found to be 
alive by KVMInvestigator
2018-08-15 20:06:35,191 INFO  [c.c.h.HighAvailabilityManagerImpl] 
(HA-Worker-0:ctx-053d5a41 work-34) (logid:872583f0) Rescheduling work 
HAWork[34-HA-217-Running-Investigating] to try again at Wed Aug 15 20:07:36 
CEST 2018
2018-08-15 20:06:35,191 WARN  [c.c.h.HighAvailabilityManagerImpl] 
(HA-Worker-0:ctx-053d5a41 work-34) (logid:872583f0) Giving up, retried max. 
times for work: HAWork[34-HA-217-Running-Investigating]
2018-08-15 21:40:03,884 INFO  [c.c.h.HighAvailabilityManagerImpl] 
(HA-Worker-3:ctx-bc7feed5 work-35) (logid:f03140ba) Processing work 
HAWork[35-HA-217-Running-Investigating]
2018-08-15 21:40:03,903 DEBUG [c.c.h.CheckOnAgentInvestigator] 
(HA-Worker-3:ctx-bc7feed5 work-35) (logid:f03140ba) Agent responded with state 
PowerOff
2018-08-15 21:40:03,903 INFO  [c.c.h.HighAvailabilityManagerImpl] 
(HA-Worker-3:ctx-bc7feed5 work-35) (logid:f03140ba) SimpleInvestigator found 
VM[SecondaryStorageVm|s-217-VM] to be alive? false

--

Here the problems seem to be that the "Investigator" claims the VM to be alive 
even the host is down.

All host have ipmi configured. So i would expect Cloudstack Management should 
shoot up the node in the head via ipmi powerdown the ensure there is not 
connection problem.

Then the state of all VM's should be updated as "stopped". Now a new eligable 
host should be chosen to start the vm's again.

For me it seems this issue recurring from time to time. So i would be really 
interested in how to create a Simulator for this and we clould just write a 
test case? Iam not sure if the

ipmi stuff can be emulated with the simulator.

 

This Bug could be simmilar or related to CLOUDSTACK-10246 and CLOUDSTACK-8713.

Ive tested CLOUDSTACK-10246 and it does not solve the issue here.

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to