[ https://issues.apache.org/jira/browse/CLOUDSTACK-8848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14952272#comment-14952272 ]
ASF GitHub Bot commented on CLOUDSTACK-8848: -------------------------------------------- GitHub user resmo opened a pull request: https://github.com/apache/cloudstack/pull/922 [4.5] CLOUDSTACK-8848: ensure power state is up to date for missing PowerState handling 1:1 cherry-picks, see #909 You can merge this pull request into a Git repository by running: $ git pull https://github.com/resmo/cloudstack fix/4.5-8848 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/cloudstack/pull/922.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #922 ---- commit 30c704915c882f8ac2ab551aa157d0b7a0bdb47f Author: Rene Moser <re...@apache.org> Date: 2015-09-24T19:10:26Z CLOUDSTACK-8848: ensure power state is up to date when handling missing VMs in powerReport There 2 things which has been changed. * We look on power_state_update_time instead of update_time. Didn't make sense to me at all to look at update_time. * Due DB update optimisation, powerState will only be updated if < MAX_CONSECUTIVE_SAME_STATE_UPDATE_COUNT. That is why we can not rely on these information unless we make sure these are up to date. commit 968991e0965d2279ba5f5c9e6496dc1a39dbccac Author: Daan Hoogland <d...@onecht.net> Date: 2015-10-02T09:42:42Z CLOUDSTACK-8848: added null pointer guard to new public method ---- > Ensure power state is up to date when handling missing VMs in pwerReport > ------------------------------------------------------------------------ > > Key: CLOUDSTACK-8848 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-8848 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: VMware > Affects Versions: 4.5.2, 4.6.0 > Reporter: René Moser > Assignee: Daan Hoogland > Priority: Blocker > Fix For: 4.5.3, 4.6.0 > > > In some conditions (race condition), VR gets rebooted after a out of band > migration was done on vCenter. > {panel:bgColor=#FFFFCE} > Note, new global setting in 4.5.2 "VR reboot after out of band migration" is > set to *false* and this looks more like a bug. > {panel} > After a VR migration to a host _and_ when the VM power state report gathering > is running, the VR (and also any user VM as well) will get into the > "PowerReportMissing". > If the VM is a VR. it will be powered off and started again on vCenter. That > is what we see. In can not be reproduced every time a migration was done, but > it seems the problem is related to "powerReportMissing". > I grep-ed the source and found this line related > https://github.com/apache/cloudstack/blob/4.5.2/engine/orchestration/src/com/cloud/vm/VirtualMachineManagerImpl.java#L3616 > and also it seems that the graceful period might be also related, > https://github.com/apache/cloudstack/blob/4.5.2/engine/orchestration/src/com/cloud/vm/VirtualMachinePowerStateSyncImpl.java#L110 > In case it is a user VM, we see in the logs that the state will be set to > powered-off, but the VM keeps running. After a while a new VM power state > report is running and the state for the user VM gets updated to Running > again. Below the logs > h2. VR r-342-VM reboot log > {code:none} > 2015-09-15 09:37:06,508 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] > (DirectAgentCronJob-253:ctx-c4f59216) Run missing VM report. current time: > 1442302626508 > 2015-09-15 09:37:06,508 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] > (DirectAgentCronJob-253:ctx-c4f59216) Detected missing VM. host: 19, vm id: > 342, power state: PowerReportMissing, last state update: 1442302506000 > 2015-09-15 09:37:06,508 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] > (DirectAgentCronJob-253:ctx-c4f59216) vm id: 342 - time since last state > update(120508ms) has passed graceful period > 2015-09-15 09:37:06,517 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] > (DirectAgentCronJob-253:ctx-c4f59216) VM state report is updated. host: 19, > vm id: 342, power state: PowerReportMissing > 2015-09-15 09:37:06,525 INFO [c.c.v.VirtualMachineManagerImpl] > (DirectAgentCronJob-253:ctx-c4f59216) VM r-342-VM is at Running and we > received a power-off report while there is no pending jobs on it > 2015-09-15 09:37:06,532 DEBUG [c.c.a.t.Request] > (DirectAgentCronJob-253:ctx-c4f59216) Seq 19-4511199451741686482: Sending { > Cmd , MgmtId: 345051122106, via: 19(cu01-testpod01-esx03.stxt.media.int), > Ver: v1, Flags: 100011, > [{"com.cloud.agent.api.StopCommand":{"isProxy":false,"executeInSequence":false,"checkBeforeCleanup":true,"vmName":"r-342-VM","wait":0}}] > } > 2015-09-15 09:37:06,532 DEBUG [c.c.a.t.Request] > (DirectAgentCronJob-253:ctx-c4f59216) Seq 19-4511199451741686482: Executing: > { Cmd , MgmtId: 345051122106, via: 19(cu01-testpod01-esx03.stxt.media.int), > Ver: v1, Flags: 100011, > [{"com.cloud.agent.api.StopCommand":{"isProxy":false,"executeInSequence":false,"checkBeforeCleanup":true,"vmName":"r-342-VM","wait":0}}] > } > 2015-09-15 09:37:06,532 DEBUG [c.c.a.m.DirectAgentAttache] > (DirectAgent-136:ctx-9bc0a401) Seq 19-4511199451741686482: Executing request > 2015-09-15 09:37:06,532 INFO [c.c.h.v.r.VmwareResource] > (DirectAgent-136:ctx-9bc0a401 cu01-testpod01-esx03.stxt.media.int, cmd: > StopCommand) Executing resource StopCommand: > {"isProxy":false,"executeInSequence":false,"checkBeforeCleanup":true,"vmName":"r-342-VM","wait":0} > 2015-09-15 09:37:06,551 DEBUG [c.c.h.v.m.HostMO] > (DirectAgent-136:ctx-9bc0a401 cu01-testpod01-esx03.stxt.media.int, cmd: > StopCommand) find VM r-342-VM on host > 2015-09-15 09:37:06,551 INFO [c.c.h.v.m.HostMO] > (DirectAgent-136:ctx-9bc0a401 cu01-testpod01-esx03.stxt.media.int, cmd: > StopCommand) VM r-342-VM not found in host cache > 2015-09-15 09:37:06,551 DEBUG [c.c.h.v.m.HostMO] > (DirectAgent-136:ctx-9bc0a401 cu01-testpod01-esx03.stxt.media.int, cmd: > StopCommand) load VM cache on host > 2015-09-15 09:37:06,577 INFO [c.c.h.v.r.VmwareResource] > (DirectAgent-136:ctx-9bc0a401 cu01-testpod01-esx03.stxt.media.int, cmd: > StopCommand) VM r-342-VM is no longer in vSphere > 2015-09-15 09:37:06,578 DEBUG [c.c.a.m.DirectAgentAttache] > (DirectAgent-136:ctx-9bc0a401) Seq 19-4511199451741686482: Response Received: > 2015-09-15 09:37:06,578 DEBUG [c.c.a.t.Request] > (DirectAgent-136:ctx-9bc0a401) Seq 19-4511199451741686482: Processing: { > Ans: , MgmtId: 345051122106, via: 19, Ver: v1, Flags: 10, > [{"com.cloud.agent.api.StopAnswer":{"result":true,"details":"VM r-342-VM is > no longer in vSphere","wait":0}}] } > 2015-09-15 09:37:06,578 DEBUG [c.c.a.t.Request] > (DirectAgentCronJob-253:ctx-c4f59216) Seq 19-4511199451741686482: Received: > { Ans: , MgmtId: 345051122106, via: 19, Ver: v1, Flags: 10, { StopAnswer } } > 2015-09-15 09:37:06,585 DEBUG [c.c.n.r.VirtualNetworkApplianceManagerImpl] > (DirectAgentCronJob-253:ctx-c4f59216) Successfully updated user statistics as > a part of domR VM[DomainRouter|r-342-VM] reboot/stop > 2015-09-15 09:37:06,617 DEBUG [c.c.c.CapacityManagerImpl] > (DirectAgentCronJob-253:ctx-c4f59216) VM state transitted from :Running to > Stopped with event: FollowAgentPowerOffReportvm's original host id: 20 new > host id: null host id before state transition: 19 > 2015-09-15 09:37:06,627 WARN [o.a.c.alerts] > (DirectAgentCronJob-253:ctx-c4f59216) alertType:: 27 // dataCenterId:: 1 // > podId:: 1 // clusterId:: null // message:: VM state sync alert > 2015-09-15 09:37:06,630 DEBUG [c.c.a.AlertManagerImpl] > (DirectAgentCronJob-253:ctx-c4f59216) Have already sent: 1 emails for alert > type '27' -- skipping send email > 2015-09-15 09:37:06,630 INFO [c.c.v.VirtualMachineManagerImpl] > (DirectAgentCronJob-253:ctx-c4f59216) VM r-342-VM is sync-ed to at Stopped > state according to power-off report from hypervisor > 2015-09-15 09:37:06,630 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] > (DirectAgentCronJob-253:ctx-c4f59216) Done with process of VM state report. > host: 19 > 2015-09-15 09:37:07,506 DEBUG [c.c.c.ConsoleProxyManagerImpl] > (consoleproxy-1:ctx-90d7ef3a) Zone 1 is ready to launch console proxy > 2015-09-15 09:37:07,524 INFO [c.c.h.v.r.VmwareResource] > (DirectAgentCronJob-64:ctx-47298215) Scan hung worker VM to recycle > 2015-09-15 09:37:07,611 INFO [c.c.h.v.r.VmwareResource] > (DirectAgentCronJob-254:ctx-757ec6a9) Scan hung worker VM to recycle > 2015-09-15 09:37:07,722 DEBUG [c.c.a.m.DirectAgentAttache] > (DirectAgentCronJob-64:ctx-47298215) Ping from > 25(cu01-testpod01-esx02.stxt.media.int) > 2015-09-15 09:37:07,722 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] > (DirectAgentCronJob-64:ctx-47298215) Process host VM state report from ping > process. host: 25 > 2015-09-15 09:37:07,722 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] > (DirectAgentCronJob-64:ctx-47298215) Process VM state report. host: 25, > number of records in report: 0 > 2015-09-15 09:37:07,723 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] > (DirectAgentCronJob-64:ctx-47298215) Done with process of VM state report. > host: 25 > 2015-09-15 09:37:07,732 DEBUG [o.a.c.s.SecondaryStorageManagerImpl] > (secstorage-1:ctx-d63339a0) Zone 1 is ready to launch secondary storage VM > 2015-09-15 09:37:07,774 DEBUG [c.c.a.m.DirectAgentAttache] > (DirectAgentCronJob-254:ctx-757ec6a9) Ping from > 20(cu01-testpod01-esx04.stxt.media.int) > 2015-09-15 09:37:07,774 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] > (DirectAgentCronJob-254:ctx-757ec6a9) Process host VM state report from ping > process. host: 20 > 2015-09-15 09:37:07,806 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] > (DirectAgentCronJob-254:ctx-757ec6a9) Process VM state report. host: 20, > number of records in report: 11 > 2015-09-15 09:37:07,806 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] > (DirectAgentCronJob-254:ctx-757ec6a9) VM state report. host: 20, vm id: 356, > power state: PowerOn > 2015-09-15 09:37:07,809 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] > (DirectAgentCronJob-254:ctx-757ec6a9) VM power state does not change, skip DB > writing. vm id: 356 > 2015-09-15 09:37:07,809 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] > (DirectAgentCronJob-254:ctx-757ec6a9) VM state report. host: 20, vm id: 342, > power state: PowerOn > 2015-09-15 09:37:07,825 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] > (DirectAgentCronJob-254:ctx-757ec6a9) VM state report is updated. host: 20, > vm id: 342, power state: PowerOn > 2015-09-15 09:37:07,832 INFO [c.c.v.VirtualMachineManagerImpl] > (DirectAgentCronJob-254:ctx-757ec6a9) VM r-342-VM is at Stopped and we > received a power-on report while there is no pending jobs on it > 2015-09-15 09:37:07,845 INFO [c.c.n.r.VirtualNetworkApplianceManagerImpl] > (DirectAgentCronJob-254:ctx-757ec6a9) Schedule a router reboot task as router > 342 is powered-on out-of-band. we need to reboot to refresh network rules > 2015-09-15 09:37:07,846 INFO [c.c.n.r.VirtualNetworkApplianceManagerImpl] > (RouterMonitor-1:ctx-c118d628) Reboot router 342 to refresh network rules > 2015-09-15 09:37:07,847 DEBUG [c.c.c.CapacityManagerImpl] > (DirectAgentCronJob-254:ctx-757ec6a9) VM state transitted from :Stopped to > Running with event: FollowAgentPowerOnReportvm's original host id: 20 new > host id: 20 host id before state transition: null > {code} > h2. user vm log for i-2-355-VM > {code:none} > 2015-09-14 17:35:06,525 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] > (DirectAgentCronJob-36:ctx-08ae2caf) Run missing VM report. current time: > 1442244906525 > 2015-09-14 17:35:06,525 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] > (DirectAgentCronJob-36:ctx-08ae2caf) Detected missing VM. host: 19, vm id: > 355, power state: PowerReportMissing, last state update: 1441982337000 > 2015-09-14 17:35:06,525 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] > (DirectAgentCronJob-36:ctx-08ae2caf) vm id: 355 - time since last state > update(262569525ms) has passed graceful period > 2015-09-14 17:35:06,534 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] > (DirectAgentCronJob-36:ctx-08ae2caf) VM state report is updated. host: 19, vm > id: 355, power state: PowerReportMissing > 2015-09-14 17:35:06,540 INFO [c.c.v.VirtualMachineManagerImpl] > (DirectAgentCronJob-36:ctx-08ae2caf) VM i-2-355-VM is at Running and we > received a power-off report while there is no pending jobs on it > 2015-09-14 17:35:06,546 DEBUG [c.c.a.t.Request] > (DirectAgentCronJob-36:ctx-08ae2caf) Seq 19-4511199451741683988: Sending { > Cmd , MgmtId: 345051122106, via: 19(cu01-testpod01-esx03.stxt.media.int), > Ver: v1, Flags: 100011, > [{"com.cloud.agent.api.StopCommand":{"isProxy":false,"executeInSequence":false,"checkBeforeCleanup":true,"vmName":"i-2-355-VM","wait":0}}] > } > 2015-09-14 17:35:06,546 DEBUG [c.c.a.t.Request] > (DirectAgentCronJob-36:ctx-08ae2caf) Seq 19-4511199451741683988: Executing: > { Cmd , MgmtId: 345051122106, via: 19(cu01-testpod01-esx03.stxt.media.int), > Ver: v1, Flags: 100011, > [{"com.cloud.agent.api.StopCommand":{"isProxy":false,"executeInSequence":false,"checkBeforeCleanup":true,"vmName":"i-2-355-VM","wait":0}}] > } > 2015-09-14 17:35:06,546 DEBUG [c.c.a.m.DirectAgentAttache] > (DirectAgent-424:ctx-4793889d) Seq 19-4511199451741683988: Executing request > 2015-09-14 17:35:06,547 INFO [c.c.h.v.r.VmwareResource] > (DirectAgent-424:ctx-4793889d cu01-testpod01-esx03.stxt.media.int, cmd: > StopCommand) Executing resource StopCommand: > {"isProxy":false,"executeInSequence":false,"checkBeforeCleanup":true,"vmName":"i-2-355-VM","wait":0} > 2015-09-14 17:35:06,565 DEBUG [c.c.h.v.m.HostMO] > (DirectAgent-424:ctx-4793889d cu01-testpod01-esx03.stxt.media.int, cmd: > StopCommand) find VM i-2-355-VM on host > 2015-09-14 17:35:06,565 INFO [c.c.h.v.m.HostMO] > (DirectAgent-424:ctx-4793889d cu01-testpod01-esx03.stxt.media.int, cmd: > StopCommand) VM i-2-355-VM not found in host cache > 2015-09-14 17:35:06,565 DEBUG [c.c.h.v.m.HostMO] > (DirectAgent-424:ctx-4793889d cu01-testpod01-esx03.stxt.media.int, cmd: > StopCommand) load VM cache on host > 2015-09-14 17:35:06,595 INFO [c.c.h.v.r.VmwareResource] > (DirectAgent-424:ctx-4793889d cu01-testpod01-esx03.stxt.media.int, cmd: > StopCommand) VM i-2-355-VM is no longer in vSphere > 2015-09-14 17:35:06,595 DEBUG [c.c.a.m.DirectAgentAttache] > (DirectAgent-424:ctx-4793889d) Seq 19-4511199451741683988: Response Received: > 2015-09-14 17:35:06,595 DEBUG [c.c.a.t.Request] > (DirectAgent-424:ctx-4793889d) Seq 19-4511199451741683988: Processing: { > Ans: , MgmtId: 345051122106, via: 19, Ver: v1, Flags: 10, > [{"com.cloud.agent.api.StopAnswer":{"result":true,"details":"VM i-2-355-VM is > no longer in vSphere","wait":0}}] } > 2015-09-14 17:35:06,595 DEBUG [c.c.a.t.Request] > (DirectAgentCronJob-36:ctx-08ae2caf) Seq 19-4511199451741683988: Received: { > Ans: , MgmtId: 345051122106, via: 19, Ver: v1, Flags: 10, { StopAnswer } } > 2015-09-14 17:35:06,666 DEBUG [c.c.n.NetworkModelImpl] > (DirectAgentCronJob-36:ctx-08ae2caf) Service SecurityGroup is not supported > in the network id=211 > 2015-09-14 17:35:06,672 DEBUG [c.c.n.NetworkModelImpl] > (DirectAgentCronJob-36:ctx-08ae2caf) Service SecurityGroup is not supported > in the network id=211 > 2015-09-14 17:35:06,672 DEBUG [c.c.c.CapacityManagerImpl] > (DirectAgentCronJob-36:ctx-08ae2caf) VM state transitted from :Running to > Stopped with event: FollowAgentPowerOffReportvm's original host id: 19 new > host id: null host id before state transition: 19 > 2015-09-14 17:35:06,689 WARN [o.a.c.alerts] > (DirectAgentCronJob-36:ctx-08ae2caf) alertType:: 27 // dataCenterId:: 1 // > podId:: 1 // clusterId:: null // message:: VM state sync alert > 2015-09-14 17:35:06,692 DEBUG [c.c.a.AlertManagerImpl] > (DirectAgentCronJob-36:ctx-08ae2caf) Have already sent: 1 emails for alert > type '27' -- skipping send email > 2015-09-14 17:35:06,692 INFO [c.c.v.VirtualMachineManagerImpl] > (DirectAgentCronJob-36:ctx-08ae2caf) VM i-2-355-VM is sync-ed to at Stopped > state according to power-off report from hypervisor > 2015-09-14 17:35:06,692 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] > (DirectAgentCronJob-36:ctx-08ae2caf) Done with process of VM state report. > host: 19 > ... > 2015-09-14 17:35:07,754 DEBUG [c.c.v.VirtualMachinePowerStateSyncImpl] > (DirectAgentCronJob-137:ctx-80d2695e) VM state report is updated. host: 20, > vm id: 355, power state: PowerOn > 2015-09-14 17:35:07,762 INFO [c.c.v.VirtualMachineManagerImpl] > (DirectAgentCronJob-137:ctx-80d2695e) VM i-2-355-VM is at Stopped and we > received a power-on report while there is no pending jobs on it > 2015-09-14 17:35:07,793 DEBUG [c.c.c.CapacityManagerImpl] > (DirectAgentCronJob-137:ctx-80d2695e) VM state transitted from :Stopped to > Running with event: FollowAgentPowerOnReportvm's original host id: 19 new > host id: 20 host id before state transition: nul > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)