[ https://issues.apache.org/jira/browse/CLOUDSTACK-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Abhinandan Prateek updated CLOUDSTACK-4740: ------------------------------------------- Assignee: Sateesh Chodapuneedi > Some vSphere VMs are shutdown when ACS is restarted > --------------------------------------------------- > > Key: CLOUDSTACK-4740 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-4740 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: Management Server > Affects Versions: 4.1.0, 4.1.1, 4.2.0, Future > Environment: I'm running ACS 4.1.1 with vSphere 5.1 > Reporter: ilya musayev > Assignee: Sateesh Chodapuneedi > Priority: Critical > Labels: management, poweroff > Fix For: 4.3.0 > > > If management server is restarted, when management server starts - it checks > whether the agentState for vSphere VMs and if it does not get a proper > response, it marks them as stopped. > As the result, some of my virtual instances would shutdown. > Attempting to analyze this issue further, here are my findings and errors > seen in the log. > 2013-09-25 14:35:49,928 DEBUG [vmware.resource.VmwareResource] > (AgentTaskPool-1:null) Detecting a new state but couldn't find a old state so > adding it to the changes: i-2-262-acs-docs-fc17 > 2013-09-25 14:35:51,213 DEBUG [agent.transport.Request] > (AgentTaskPool-1:null) Seq -1--1: Startup request from directly connected > host: { Cmd , MgmtId: -1, via: -1, Ver: v1, Flags: 11, > [{"cpus":16,"speed":2199,"memory":68683468800,"dom0MinMemory":0,"poolSync":false,"vms":{"i-8-270-CLOUD411":{"state":"Running"},"r-15-CLOUD41-OLD":{"state":"Stopped"},"v-260-CLOUD411":{"state":"Running"},"i-2-283-vmbld01l-ops-08":{"state":"Running"},"i-2-104-ACS41VM":{"state":"Running"},"--s-1-CLOUD41-OLD":{"state":"Running"},"i-27-280-CLOUD411":{"state":"Running"},"i-2-285-ossec01l-ops-08":{"state":"Running"},"i-2-262-acs-docs-fc17":{"state":"Stopped"},"i-24-265-test3":{"state":"Running"},"cloud01l-ops-08.portal.webmd.com":{"state":"Running"},"i-2-278-demo01t-ops-08":{"state":"Running"},"s-63-CLOUD411":{"state":"Running"},"r-66-CLOUD411":{"state":"Running"},"i-2-281-acs-appliance":{"state":"Running"}},"caps":"hvm","hypervisorType":"VMware","hostDetails":{"com.cloud.network.Networks.RouterPrivateIpStrategy":"DcGlobal","NativeHA":"true"},"hypervisorVersion":"5.0","type":"Routing","dataCenter":"2","pod":"2","cluster":"3","guid":"HostSystem:host-19...@vc00q-ops-08.portal.webmd.com","name":"vmha62d-ops-08.portal.webmd.com","version":"4.1.1-SNAPSHOT","privateIpAddress":"172.25.243.31","privateMacAddress":"68:b5:99:73:0b:c2","privateNetmask":"255.255.255.0","storageIpAddress":"172.25.243.31","storageNetmask":"255.255.255.0","storageMacAddress":"68:b5:99:73:0b:c2","wait":0},{"totalSize":0,"poolInfo":{"uuid":"72c8aedb-58c4-4569-ac51-adc5af770bf6","host":"vmha62d-ops-08.portal.webmd.com","localPath":"","hostPath":"datastore-19718","poolType":"LVM","capacityBytes":141465485312,"availableBytes":140383354880},"resourceType":"STORAGE_POOL","hostDetails":{},"type":"Storage","dataCenter":"2","pod":"2","cluster":"3","guid":"72c8aedb-58c4-4569-ac51-adc5af770bf6","name":"72c8aedb-58c4-4569-ac51-adc5af770bf6","wait":0}] > } > 2013-09-25 14:35:53,614 DEBUG [cloud.vm.VirtualMachineManagerImpl] > (AgentTaskPool-1:null) VM i-2-262-acs-docs-fc17: cs state = Running and > realState = Stopped > 2013-09-25 14:35:53,614 DEBUG [cloud.vm.VirtualMachineManagerImpl] > (AgentTaskPool-1:null) VM i-2-262-acs-docs-fc17: cs state = Running and > realState = Stopped > 2013-09-25 14:35:53,614 INFO [cloud.ha.HighAvailabilityManagerImpl] > (AgentTaskPool-1:null) Skip HA for VMware VM i-2-262-acs-docs-fc17 > 2013-09-25 14:35:53,694 DEBUG [agent.transport.Request] > (AgentTaskPool-1:null) Seq 11-1418264581: Sending { Cmd , MgmtId: > 345049078181, via: 11, Ver: v1, Flags: 100101, > [{"StopCommand":{"isProxy":false,"vmName":"i-2-262-acs-docs-fc17","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-278-demo01t-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-281-acs-appliance","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-283-vmbld01l-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-285-ossec01l-ops-08","wait":0}}] > } > 2013-09-25 14:35:53,695 DEBUG [agent.transport.Request] > (AgentTaskPool-1:null) Seq 11-1418264581: Executing: { Cmd , MgmtId: > 345049078181, via: 11, Ver: v1, Flags: 100101, > [{"StopCommand":{"isProxy":false,"vmName":"i-2-262-acs-docs-fc17","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-278-demo01t-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-281-acs-appliance","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-283-vmbld01l-ops-08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-285-ossec01l-ops-08","wait":0}}] > } > 2013-09-25 14:35:53,702 INFO [vmware.resource.VmwareResource] > (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) Executing resource > StopCommand: {"isProxy":false,"vmName":"i-2-262-acs-docs-fc17","wait":0} > 2013-09-25 14:35:53,703 DEBUG [vmware.mo.HostMO] > (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) find VM i-2-262-acs-docs-fc17 > on host > 2013-09-25 14:35:54,753 INFO [vmware.resource.VmwareResource] > (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) VM i-2-262-acs-docs-fc17 is > already in stopped state > 2013-09-25 14:37:50,264 DEBUG [agent.transport.Request] (DirectAgent-3:null) > Seq 11-1418264581: Processing: { Ans: , MgmtId: 345049078181, via: 11, Ver: > v1, Flags: 100, [{"StopAnswer":{"vncPort":0,"result":true,"details":"VM > i-2-262-acs-docs-fc17 is already in stopped > state","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Stop VM > i-2-278-demo01t-ops-08 > Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Stop > VM i-2-281-acs-appliance > Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Stop > VM i-2-283-vmbld01l-ops-08 > Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"Stop > VM i-2-285-ossec01l-ops-08 Succeed","wait":0}}] } > 2013-09-25 14:38:14,528 DEBUG [vmware.mo.HostMO] > (DirectAgent-20:vmha62d-ops-08.portal.webmd.com) find VM > i-2-262-acs-docs-fc17 on host > If for whatever reason agentState is stopped for vSphere, we should not force > power off the VM, neither do we have to restart it. > The code that causes the restart is in > engine/orchestration/src/com/cloud/vm/VirtualMachineManagerImpl.java, > lines 2365 - 2376 > In this snippet we do the check if cs state == Running and agentState == > Stopped > We should not power off the VM if we cannot retrieve the state! > 2233 protected Command compareState(long hostId, VMInstanceVO vm, final > AgentVmInfo info, final boolean fullSync, boolean trackExternalChange) { > 2234 State agentState = info.state; > 2235 final State serverState = vm.getState(); > 2236 final String serverName = vm.getInstanceName(); > 2237 > 2238 Command command = null; > 2239 s_logger.debug("VM " + serverName + ": cs state = " + > serverState + " and realState = " + agentState); > 2240 if (s_logger.isDebugEnabled()) { > 2241 s_logger.debug("VM " + serverName + ": cs state = " + > serverState + " and realState = " + agentState); > 2242 } > 2243 > 2244 if (agentState == State.Error) { > 2245 agentState = State.Stopped; > 2246 > 2247 short alertType = AlertManager.ALERT_TYPE_USERVM; > 2248 if (VirtualMachine.Type.DomainRouter.equals(vm.getType())) { > 2249 alertType = AlertManager.ALERT_TYPE_DOMAIN_ROUTER; > 2250 } else if > (VirtualMachine.Type.ConsoleProxy.equals(vm.getType())) { > 2251 alertType = AlertManager.ALERT_TYPE_CONSOLE_PROXY; > 2252 } else if > (VirtualMachine.Type.SecondaryStorageVm.equals(vm.getType())) { > 2253 alertType = AlertManager.ALERT_TYPE_SSVM; > 2254 } > 2255 > 2256 HostPodVO podVO = _podDao.findById(vm.getPodIdToDeployIn()); > 2257 DataCenterVO dcVO = _dcDao.findById(vm.getDataCenterId()); > 2258 HostVO hostVO = _hostDao.findById(vm.getHostId()); > 2259 > 2260 String hostDesc = "name: " + hostVO.getName() + " (id:" + > hostVO.getId() + "), availability zone: " + dcVO.getName() + ", pod: " + > podVO.getName(); > 2261 _alertMgr.sendAlert(alertType, vm.getDataCenterId(), > vm.getPodIdToDeployIn(), "VM (name: " + vm.getInstanceName() + ", id: " + > vm.getId() + ") stopped on host " + > 2262 > hostDesc + " due to storage failure", > 2263 "Virtual Machine " + vm.getInstanceName() + " (id: " + > vm.getId() + ") running on host [" + vm.getHostId() + "] stopped due to > storage failure."); > 2264 } > 2265 > 2266 if (trackExternalChange) { > 2267 if (serverState == State.Starting) { > 2268 if (vm.getHostId() != null && vm.getHostId() != hostId) { > 2269 s_logger.info("CloudStack is starting VM on host " + > vm.getHostId() + ", but status report comes from a different host " + hostId + > 2270 ", skip status sync for vm: " + > vm.getInstanceName()); > 2271 return null; > 2272 } > 2273 } > 2274 if (vm.getHostId() == null || hostId != vm.getHostId()) { > 2275 try { > 2276 ItWorkVO workItem = > _workDao.findByOutstandingWork(vm.getId(), State.Migrating); > 2277 if (workItem == null) { > 2278 stateTransitTo(vm, > VirtualMachine.Event.AgentReportMigrated, hostId); > 2279 } > 2280 } catch (NoTransitionException e) { > 2281 } > 2282 } > 2283 } > 2284 > 2285 // during VM migration time, don't sync state will agent status > update > 2286 if (serverState == State.Migrating) { > 2287 s_logger.debug("Skipping vm in migrating state: " + vm); > 2288 return null; > 2289 } > 2290 > 2291 if (trackExternalChange) { > 2292 if (serverState == State.Starting) { > 2293 if (vm.getHostId() != null && vm.getHostId() != hostId) { > 2294 s_logger.info("CloudStack is starting VM on host " + > vm.getHostId() + ", but status report comes from a different host " + hostId + > 2295 ", skip status sync for vm: " + > vm.getInstanceName()); > 2296 return null; > 2297 } > 2298 } > 2299 > 2300 if (serverState == State.Running) { > 2301 try { > 2302 // > 2303 // we had a bug that sometimes VM may be at Running > State > 2304 // but host_id is null, we will cover it here. > 2305 // means that when CloudStack DB lost of host > information, > 2306 // we will heal it with the info reported from host > 2307 // > 2308 if (vm.getHostId() == null || hostId != > vm.getHostId()) { > 2309 if (s_logger.isDebugEnabled()) { > 2310 s_logger.debug("detected host change when VM > " + vm + " is at running state, VM could be live-migrated externally from > host " + vm.getHostId() + > 2311 " to host " + hostId); > 2312 } > 2313 > 2314 stateTransitTo(vm, > VirtualMachine.Event.AgentReportMigrated, hostId); > 2315 } > 2316 } catch (NoTransitionException e) { > 2317 s_logger.warn(e.getMessage()); > 2318 } > 2319 } > 2320 } > 2321 > 2322 if (agentState == serverState) { > 2323 if (s_logger.isDebugEnabled()) { > 2324 s_logger.debug("Both states are " + agentState + " for " > + vm); > 2325 } > 2326 assert (agentState == State.Stopped || agentState == > State.Running) : "If the states we send up is changed, this must be changed."; > 2327 if (agentState == State.Running) { > 2328 try { > 2329 stateTransitTo(vm, > VirtualMachine.Event.AgentReportRunning, hostId); > 2330 } catch (NoTransitionException e) { > 2331 s_logger.warn(e.getMessage()); > 2332 } > 2333 // FIXME: What if someone comes in and sets it to > stopping? Then > 2334 // what? > 2335 return null; > 2336 } > 2337 > 2338 s_logger.debug("State matches but the agent said stopped so > let's send a cleanup command anyways."); > 2339 return cleanup(vm); > 2340 } > 2341 > 2342 if (agentState == State.Shutdowned) { > 2343 if (serverState == State.Running || serverState == > State.Starting || serverState == State.Stopping) { > 2344 try { > 2345 advanceStop(vm, true); > 2346 } catch (AgentUnavailableException e) { > 2347 assert (false) : "How do we hit this with forced > on?"; > 2348 return null; > 2349 } catch (OperationTimedoutException e) { > 2350 assert (false) : "How do we hit this with forced > on?"; > 2351 return null; > 2352 } catch (ConcurrentOperationException e) { > 2353 assert (false) : "How do we hit this with forced > on?"; > 2354 return null; > 2355 } > 2356 } else { > 2357 s_logger.debug("Sending cleanup to a shutdowned vm: " + > vm.getInstanceName()); > 2358 command = cleanup(vm); > 2359 } > 2360 } else if (agentState == State.Stopped) { > 2361 // This state means the VM on the agent was detected > previously > 2362 // and now is gone. This is slightly different than if the VM > 2363 // was never completed but we still send down a Stop Command > 2364 // to ensure there's cleanup. > 2365 if (serverState == State.Running) { > 2366 // Our records showed that it should be running so let's > restart > 2367 // it. > 2368 _haMgr.scheduleRestart(vm, false); > 2369 } else if (serverState == State.Stopping) { > 2370 _haMgr.scheduleStop(vm, hostId, WorkType.ForceStop); > 2371 s_logger.debug("Scheduling a check stop for VM in > stopping mode: " + vm); > 2372 } else if (serverState == State.Starting) { > 2373 s_logger.debug("Ignoring VM in starting mode: " + > vm.getInstanceName()); > 2374 _haMgr.scheduleRestart(vm, false); > 2375 } > 2376 command = cleanup(vm); > I'd be happy to test the fix if such is comes out. > Thanks > ilya -- This message was sent by Atlassian JIRA (v6.1#6144)