[URGENT] cloudstack SSVM and router failed to start after power blackout
OK this is my problem, after blackout I can''t start virtual router, and ssvm not detected in my cloudstack system. SSVM recreated itself but stuck in starting state. What should I do?Please help me..
Re: [URGENT] cloudstack SSVM and router failed to start after power blackout
Hi, SSVMs and VRs are stateless so if restarts are not working for you, you may (force) stop and remove them. The CloudStack HA thread(s) would kickstart new ones after a certain timeout, to speed this behaviour you may restart CloudStack as well. If your problem still persists after trying above you may try debugging the issue: https://cwiki.apache.org/confluence/display/CLOUDSTACK/SSVM,+templates,+Secondary+storage+troubleshooting Regards. On Thu, Jun 19, 2014 at 7:34 PM, dimas yoga pratama smid...@gmail.com wrote: OK this is my problem, after blackout I can''t start virtual router, and ssvm not detected in my cloudstack system. SSVM recreated itself but stuck in starting state. What should I do?Please help me..
Re: [URGENT] cloudstack SSVM and router failed to start after power blackout
Hi, from the infrastructure tab I can detect the hosts, but both of the hosts show alert state., I already try to force reconnect but it fails. What should I do? Now the CPVM fail to start too. On Thu, Jun 19, 2014 at 9:17 PM, Rohit Yadav bhais...@apache.org wrote: Hi, SSVMs and VRs are stateless so if restarts are not working for you, you may (force) stop and remove them. The CloudStack HA thread(s) would kickstart new ones after a certain timeout, to speed this behaviour you may restart CloudStack as well. If your problem still persists after trying above you may try debugging the issue: https://cwiki.apache.org/confluence/display/CLOUDSTACK/SSVM,+templates,+Secondary+storage+troubleshooting Regards. On Thu, Jun 19, 2014 at 7:34 PM, dimas yoga pratama smid...@gmail.com wrote: OK this is my problem, after blackout I can''t start virtual router, and ssvm not detected in my cloudstack system. SSVM recreated itself but stuck in starting state. What should I do?Please help me..
Re: [URGENT] cloudstack SSVM and router failed to start after power blackout
I'm not sure what could be the specific issue. You can tail the management server logs to see what is failing. After you figure out the specific issue, you may share it with us with your host os, CloudStack version details and the connected host details. Regards. On Thu, Jun 19, 2014 at 8:01 PM, dimas yoga pratama smid...@gmail.com wrote: Hi, from the infrastructure tab I can detect the hosts, but both of the hosts show alert state., I already try to force reconnect but it fails. What should I do? Now the CPVM fail to start too. On Thu, Jun 19, 2014 at 9:17 PM, Rohit Yadav bhais...@apache.org wrote: Hi, SSVMs and VRs are stateless so if restarts are not working for you, you may (force) stop and remove them. The CloudStack HA thread(s) would kickstart new ones after a certain timeout, to speed this behaviour you may restart CloudStack as well. If your problem still persists after trying above you may try debugging the issue: https://cwiki.apache.org/confluence/display/CLOUDSTACK/SSVM,+templates,+Secondary+storage+troubleshooting Regards. On Thu, Jun 19, 2014 at 7:34 PM, dimas yoga pratama smid...@gmail.com wrote: OK this is my problem, after blackout I can''t start virtual router, and ssvm not detected in my cloudstack system. SSVM recreated itself but stuck in starting state. What should I do?Please help me..
Re: [URGENT] cloudstack SSVM and router failed to start after power blackout
management log : 2014-06-19 21:49:21,312 WARN [c.c.u.n.Link] (AgentManager-Selector:null) SSL: Fail to find the generated keystore. Loading fail-safe one to continue. 2014-06-19 21:49:11,585 DEBUG [c.c.v.VirtualMachineManagerImpl] (AgentConnectTaskPool-344:ctx-07431045) Ignoring VM in starting mode: r-71-VM 2014-06-19 21:49:11,585 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentConnectTaskPool-344:ctx-07431045) VM does not require investigation so I'm marking it as Stopped: VM[DomainRouter|r-71-VM] 2014-06-19 21:49:11,585 WARN [o.a.c.f.j.AsyncJobExecutionContext] (AgentConnectTaskPool-344:ctx-07431045) Job is executed without a context, setup psudo job for the executing thread 2014-06-19 21:49:11,651 DEBUG [c.c.v.VirtualMachineManagerImpl] (AgentConnectTaskPool-344:ctx-07431045) Unable to transition the state but we're moving on because it's forced stop 2014-06-19 21:49:11,651 DEBUG [c.c.v.VirtualMachineManagerImpl] (AgentConnectTaskPool-344:ctx-07431045) Unable to cleanup VM: VM[DomainRouter|r-71-VM] ,since outstanding work item is not found 2014-06-19 21:49:11,651 ERROR [c.c.a.m.AgentManagerImpl] (AgentConnectTaskPool-344:ctx-07431045) Monitor ClusteredVirtualMachineManagerImpl says there is an error in the connect process for 2 due to Work item not found, We cannot stop VM[DomainRouter|r-71-VM] when it is in state Starting com.cloud.utils.exception.CloudRuntimeException: Work item not found, We cannot stop VM[DomainRouter|r-71-VM] when it is in state Starting at com.cloud.vm.VirtualMachineManagerImpl.advanceStop(VirtualMachineManagerImpl.java:1415) at com.cloud.vm.VirtualMachineManagerImpl.orchestrateStop(VirtualMachineManagerImpl.java:1344) at com.cloud.vm.VirtualMachineManagerImpl.advanceStop(VirtualMachineManagerImpl.java:1312) at com.cloud.ha.HighAvailabilityManagerImpl.scheduleRestart(HighAvailabilityManagerImpl.java:346) at com.cloud.vm.VirtualMachineManagerImpl.compareState(VirtualMachineManagerImpl.java:2827) at com.cloud.vm.VirtualMachineManagerImpl.fullHostSync(VirtualMachineManagerImpl.java:2384) at com.cloud.vm.VirtualMachineManagerImpl.processConnect(VirtualMachineManagerImpl.java:3035) at com.cloud.agent.manager.AgentManagerImpl.notifyMonitorsOfConnection(AgentManagerImpl.java:495) at com.cloud.agent.manager.AgentManagerImpl.handleConnectedAgent(AgentManagerImpl.java:999) at com.cloud.agent.manager.AgentManagerImpl.access$000(AgentManagerImpl.java:117) at com.cloud.agent.manager.AgentManagerImpl$HandleAgentConnectTask.runInContext(AgentManagerImpl.java:1082) at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53) at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 2014-06-19 21:49:08,251 DEBUG [c.c.s.s.SecondaryStorageManagerImpl] (secstorage-1:ctx-c407a559) Zone 1 is not ready to launch secondary storage VM yet 2014-06-19 21:49:08,282 DEBUG [c.c.c.ConsoleProxyManagerImpl] (consoleproxy-1:ctx-60f4b3c9) Zone 1 is not ready to launch console proxy yet 2014-06-19 21:49:01,197 DEBUG [c.c.n.NetworkUsageManagerImpl] (AgentConnectTaskPool-342:ctx-b66d3294) Disconnected called on 2 with status Alert 2014-06-19 21:49:01,197 DEBUG [c.c.a.m.AgentManagerImpl] (AgentConnectTaskPool-342:ctx-b66d3294) Sending Disconnect to listener: com.cloud.consoleproxy.ConsoleProxyListener 2014-06-19 21:49:01,198 DEBUG [c.c.h.Status] (AgentConnectTaskPool-342:ctx-b66d3294) Transition:[Resource state = Enabled, Agent event = AgentDisconnected, Host id = 2, name = host1.cloud.priv] 2014-06-19 21:49:01,259 DEBUG [c.c.h.Status] (AgentConnectTaskPool-342:ctx-b66d3294) Agent status update: [id = 2; name = host1.cloud.priv; old status = Connecting; event = AgentDisconnected; new status = Alert; old update count = 404; new update count = 405] 2014-06-19 21:49:01,259 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (AgentConnectTaskPool-342:ctx-b66d3294) Notifying other nodes of to disconnect 2014-06-19 21:49:01,260 DEBUG [c.c.a.m.AgentManagerImpl] (AgentConnectTaskPool-342:ctx-b66d3294) Failed to handle host connection: com.cloud.utils.exception.CloudRuntimeException: Unable to connect 2 2014-06-19 21:49:01,261 DEBUG [c.c.a.m.AgentManagerImpl] (AgentConnectTaskPool-342:ctx-b66d3294) Can not send command com.cloud.agent.api.ReadyCommand due to Host 2 is
Re: [URGENT] cloudstack SSVM and router failed to start after power blackout
Hi Dimas, Looks like the VM is in starting state and CloudStack is unable to contact the agent. Hope you've removed the VR from CloudStack using the UI. You can try restarting the management server. The issue is of sync, where one party (mgmt server) has different view of the world than the other (the host/agent). In such cases, do not remove the host else when you re-add it, it may destroy all the (user) VMs on it or simply fail. If restarting won't fix the problem, in global settings reduce the expunge timeout (that's when CloudStack marks a VM as removed, since you've just destroyed it, it can take some time to get expunged) and try again. As a final course of action I would stop the management server, then ssh to the host and destroy SSVMs, using mysql client I would change db entries for SSVM to removed/expunged (simply mark by updating row, do not remove the row), start the mgmt server again and hope it would work this time. Suggestions anyone in such a case? Regards. On Thu, Jun 19, 2014 at 8:29 PM, dimas yoga pratama smid...@gmail.com wrote: management log : 2014-06-19 21:49:21,312 WARN [c.c.u.n.Link] (AgentManager-Selector:null) SSL: Fail to find the generated keystore. Loading fail-safe one to continue. 2014-06-19 21:49:11,585 DEBUG [c.c.v.VirtualMachineManagerImpl] (AgentConnectTaskPool-344:ctx-07431045) Ignoring VM in starting mode: r-71-VM 2014-06-19 21:49:11,585 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentConnectTaskPool-344:ctx-07431045) VM does not require investigation so I'm marking it as Stopped: VM[DomainRouter|r-71-VM] 2014-06-19 21:49:11,585 WARN [o.a.c.f.j.AsyncJobExecutionContext] (AgentConnectTaskPool-344:ctx-07431045) Job is executed without a context, setup psudo job for the executing thread 2014-06-19 21:49:11,651 DEBUG [c.c.v.VirtualMachineManagerImpl] (AgentConnectTaskPool-344:ctx-07431045) Unable to transition the state but we're moving on because it's forced stop 2014-06-19 21:49:11,651 DEBUG [c.c.v.VirtualMachineManagerImpl] (AgentConnectTaskPool-344:ctx-07431045) Unable to cleanup VM: VM[DomainRouter|r-71-VM] ,since outstanding work item is not found 2014-06-19 21:49:11,651 ERROR [c.c.a.m.AgentManagerImpl] (AgentConnectTaskPool-344:ctx-07431045) Monitor ClusteredVirtualMachineManagerImpl says there is an error in the connect process for 2 due to Work item not found, We cannot stop VM[DomainRouter|r-71-VM] when it is in state Starting com.cloud.utils.exception.CloudRuntimeException: Work item not found, We cannot stop VM[DomainRouter|r-71-VM] when it is in state Starting at com.cloud.vm.VirtualMachineManagerImpl.advanceStop(VirtualMachineManagerImpl.java:1415) at com.cloud.vm.VirtualMachineManagerImpl.orchestrateStop(VirtualMachineManagerImpl.java:1344) at com.cloud.vm.VirtualMachineManagerImpl.advanceStop(VirtualMachineManagerImpl.java:1312) at com.cloud.ha.HighAvailabilityManagerImpl.scheduleRestart(HighAvailabilityManagerImpl.java:346) at com.cloud.vm.VirtualMachineManagerImpl.compareState(VirtualMachineManagerImpl.java:2827) at com.cloud.vm.VirtualMachineManagerImpl.fullHostSync(VirtualMachineManagerImpl.java:2384) at com.cloud.vm.VirtualMachineManagerImpl.processConnect(VirtualMachineManagerImpl.java:3035) at com.cloud.agent.manager.AgentManagerImpl.notifyMonitorsOfConnection(AgentManagerImpl.java:495) at com.cloud.agent.manager.AgentManagerImpl.handleConnectedAgent(AgentManagerImpl.java:999) at com.cloud.agent.manager.AgentManagerImpl.access$000(AgentManagerImpl.java:117) at com.cloud.agent.manager.AgentManagerImpl$HandleAgentConnectTask.runInContext(AgentManagerImpl.java:1082) at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53) at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 2014-06-19 21:49:08,251 DEBUG [c.c.s.s.SecondaryStorageManagerImpl] (secstorage-1:ctx-c407a559) Zone 1 is not ready to launch secondary storage VM yet 2014-06-19 21:49:08,282 DEBUG [c.c.c.ConsoleProxyManagerImpl] (consoleproxy-1:ctx-60f4b3c9) Zone 1 is not ready to launch console proxy yet 2014-06-19 21:49:01,197 DEBUG [c.c.n.NetworkUsageManagerImpl] (AgentConnectTaskPool-342:ctx-b66d3294) Disconnected called
Re: [URGENT] cloudstack SSVM and router failed to start after power blackout
Thanks Rohit, I will give feedback as soon as I tried it On Thu, Jun 19, 2014 at 10:38 PM, Rohit Yadav bhais...@apache.org wrote: Hi Dimas, Looks like the VM is in starting state and CloudStack is unable to contact the agent. Hope you've removed the VR from CloudStack using the UI. You can try restarting the management server. The issue is of sync, where one party (mgmt server) has different view of the world than the other (the host/agent). In such cases, do not remove the host else when you re-add it, it may destroy all the (user) VMs on it or simply fail. If restarting won't fix the problem, in global settings reduce the expunge timeout (that's when CloudStack marks a VM as removed, since you've just destroyed it, it can take some time to get expunged) and try again. As a final course of action I would stop the management server, then ssh to the host and destroy SSVMs, using mysql client I would change db entries for SSVM to removed/expunged (simply mark by updating row, do not remove the row), start the mgmt server again and hope it would work this time. Suggestions anyone in such a case? Regards. On Thu, Jun 19, 2014 at 8:29 PM, dimas yoga pratama smid...@gmail.com wrote: management log : 2014-06-19 21:49:21,312 WARN [c.c.u.n.Link] (AgentManager-Selector:null) SSL: Fail to find the generated keystore. Loading fail-safe one to continue. 2014-06-19 21:49:11,585 DEBUG [c.c.v.VirtualMachineManagerImpl] (AgentConnectTaskPool-344:ctx-07431045) Ignoring VM in starting mode: r-71-VM 2014-06-19 21:49:11,585 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentConnectTaskPool-344:ctx-07431045) VM does not require investigation so I'm marking it as Stopped: VM[DomainRouter|r-71-VM] 2014-06-19 21:49:11,585 WARN [o.a.c.f.j.AsyncJobExecutionContext] (AgentConnectTaskPool-344:ctx-07431045) Job is executed without a context, setup psudo job for the executing thread 2014-06-19 21:49:11,651 DEBUG [c.c.v.VirtualMachineManagerImpl] (AgentConnectTaskPool-344:ctx-07431045) Unable to transition the state but we're moving on because it's forced stop 2014-06-19 21:49:11,651 DEBUG [c.c.v.VirtualMachineManagerImpl] (AgentConnectTaskPool-344:ctx-07431045) Unable to cleanup VM: VM[DomainRouter|r-71-VM] ,since outstanding work item is not found 2014-06-19 21:49:11,651 ERROR [c.c.a.m.AgentManagerImpl] (AgentConnectTaskPool-344:ctx-07431045) Monitor ClusteredVirtualMachineManagerImpl says there is an error in the connect process for 2 due to Work item not found, We cannot stop VM[DomainRouter|r-71-VM] when it is in state Starting com.cloud.utils.exception.CloudRuntimeException: Work item not found, We cannot stop VM[DomainRouter|r-71-VM] when it is in state Starting at com.cloud.vm.VirtualMachineManagerImpl.advanceStop(VirtualMachineManagerImpl.java:1415) at com.cloud.vm.VirtualMachineManagerImpl.orchestrateStop(VirtualMachineManagerImpl.java:1344) at com.cloud.vm.VirtualMachineManagerImpl.advanceStop(VirtualMachineManagerImpl.java:1312) at com.cloud.ha.HighAvailabilityManagerImpl.scheduleRestart(HighAvailabilityManagerImpl.java:346) at com.cloud.vm.VirtualMachineManagerImpl.compareState(VirtualMachineManagerImpl.java:2827) at com.cloud.vm.VirtualMachineManagerImpl.fullHostSync(VirtualMachineManagerImpl.java:2384) at com.cloud.vm.VirtualMachineManagerImpl.processConnect(VirtualMachineManagerImpl.java:3035) at com.cloud.agent.manager.AgentManagerImpl.notifyMonitorsOfConnection(AgentManagerImpl.java:495) at com.cloud.agent.manager.AgentManagerImpl.handleConnectedAgent(AgentManagerImpl.java:999) at com.cloud.agent.manager.AgentManagerImpl.access$000(AgentManagerImpl.java:117) at com.cloud.agent.manager.AgentManagerImpl$HandleAgentConnectTask.runInContext(AgentManagerImpl.java:1082) at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53) at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 2014-06-19 21:49:08,251 DEBUG [c.c.s.s.SecondaryStorageManagerImpl] (secstorage-1:ctx-c407a559) Zone 1 is not ready to launch secondary storage VM yet