Re: Recover VM after KVM host down (and HA not working) ?
Hmm could this be the culprit ? WARN [c.c.h.KVMInvestigator] (AgentTaskPool-10:ctx-694feb6c) (logid:160220c5) Agent investigation was requested on host Host[-4-Routing], but host does not support investigation because it has no NFS storage. Skipping investigation. The primary storage is NFS. On Sat, Dec 23, 2017 at 10:14 AM, Jean-Francois Nadeau < the.jfnad...@gmail.com> wrote: > Clearly the management server doesn't realize the instance on the failed > host is not running... but the host is in Alert state and powered down, > and missing NFS heartbeats. > > 2017-12-23 14:57:52,427 DEBUG [c.c.h.Status] (AgentTaskPool-10:ctx-694feb6c) > (logid:160220c5) Transition:[Resource state = Enabled, Agent event = > AgentDisconnected, Host id = 4, name = r62-i122-36-01.domain.com] > 2017-12-23 14:58:24,487 DEBUG [c.c.c.CapacityManagerImpl] > (CapacityChecker:ctx-66fbe484) (logid:1f53cd63) Found 1 VMs on host 4 > 2017-12-23 14:58:24,495 DEBUG [c.c.c.CapacityManagerImpl] > (CapacityChecker:ctx-66fbe484) (logid:1f53cd63) Found 0 VM, not running on > host 4 > > Next step ? > > On Sat, Dec 23, 2017 at 9:49 AM, Jean-Francois Nadeau < > the.jfnad...@gmail.com> wrote: > >> I'd really like to get at the bottom of this.It does sound like the >> behavior mentioned in https://issues.apache.org/j >> ira/browse/CLOUDSTACK-5582 but should be long fixed. >> >> One suspect log entry (be unrelated) I noticed is this recurring >> exception in the manger logs : >> >> ERROR [c.c.v.UserVmManagerImpl] (UserVm-ipfetch-3:ctx-d4c44c2b) >> (logid:16dd70ad) Caught the Exception in VmIpFetchTask >> >> Which I guess is caused by the use of an external DHCP so manager fails >> to determine a running VM IP.Which brings me to my next question >> how is a VM marked for HA actually monitored ? >> >> >> On Sat, Dec 23, 2017 at 3:38 AM, Eric Green>> wrote: >> >>> If all else fails, change its state to the correct state in the MySQL >>> database and restart the management service. Sadly that is the only way >>> I >>> could do it when my Cloudstack got confused and stuck an instance in an >>> intermediate state where I couldn't do anything with it. >>> >>> On Dec 22, 2017 at 9:09 AM, >> the.jfnad...@gmail.com>> >>> wrote: >>> >>> Good morning, >>> >>> New to ACS and doing a POC with 4.10 on Centos 7 and KVM. >>> >>> Im trying to recover VMs after an host failure (powered off from OOB). >>> >>> Primary storage is NFS and IPMI is configured for the KVM hosts. Zone is >>> advanced mode with vlan separation and created a shared network with no >>> services since I wish to use an external DHCP. >>> >>> First, say I don't have a compute offering with HA enabled and a KVM >>> host >>> goes down... I can't put it in maintenance mode while down and disabling >>> it have no effect on the state of the lost VMs. VM stays in running >>> state >>> according to manager. What should I do to force restart on remaining >>> healthy hosts ? >>> >>> Then I enabled IPMI on all KVM hosts and attempted the same experience >>> with a compute offering with HA enabled. Same result. Manager do see >>> the >>> host as disconnected and powered off but take no action. I certainly >>> miss >>> something here. Please help ! >>> >>> Regards, >>> >>> Jean-Francois >>> >> >> >
Re: Recover VM after KVM host down (and HA not working) ?
Clearly the management server doesn't realize the instance on the failed host is not running... but the host is in Alert state and powered down, and missing NFS heartbeats. 2017-12-23 14:57:52,427 DEBUG [c.c.h.Status] (AgentTaskPool-10:ctx-694feb6c) (logid:160220c5) Transition:[Resource state = Enabled, Agent event = AgentDisconnected, Host id = 4, name = r62-i122-36-01.domain.com] 2017-12-23 14:58:24,487 DEBUG [c.c.c.CapacityManagerImpl] (CapacityChecker:ctx-66fbe484) (logid:1f53cd63) Found 1 VMs on host 4 2017-12-23 14:58:24,495 DEBUG [c.c.c.CapacityManagerImpl] (CapacityChecker:ctx-66fbe484) (logid:1f53cd63) Found 0 VM, not running on host 4 Next step ? On Sat, Dec 23, 2017 at 9:49 AM, Jean-Francois Nadeau < the.jfnad...@gmail.com> wrote: > I'd really like to get at the bottom of this.It does sound like the > behavior mentioned in https://issues.apache.org/ > jira/browse/CLOUDSTACK-5582 but should be long fixed. > > One suspect log entry (be unrelated) I noticed is this recurring exception > in the manger logs : > > ERROR [c.c.v.UserVmManagerImpl] (UserVm-ipfetch-3:ctx-d4c44c2b) > (logid:16dd70ad) Caught the Exception in VmIpFetchTask > > Which I guess is caused by the use of an external DHCP so manager fails to > determine a running VM IP.Which brings me to my next question how > is a VM marked for HA actually monitored ? > > > On Sat, Dec 23, 2017 at 3:38 AM, Eric Green> wrote: > >> If all else fails, change its state to the correct state in the MySQL >> database and restart the management service. Sadly that is the only way I >> could do it when my Cloudstack got confused and stuck an instance in an >> intermediate state where I couldn't do anything with it. >> >> On Dec 22, 2017 at 9:09 AM, > >> >> wrote: >> >> Good morning, >> >> New to ACS and doing a POC with 4.10 on Centos 7 and KVM. >> >> Im trying to recover VMs after an host failure (powered off from OOB). >> >> Primary storage is NFS and IPMI is configured for the KVM hosts. Zone is >> advanced mode with vlan separation and created a shared network with no >> services since I wish to use an external DHCP. >> >> First, say I don't have a compute offering with HA enabled and a KVM host >> goes down... I can't put it in maintenance mode while down and disabling >> it have no effect on the state of the lost VMs. VM stays in running state >> according to manager. What should I do to force restart on remaining >> healthy hosts ? >> >> Then I enabled IPMI on all KVM hosts and attempted the same experience >> with a compute offering with HA enabled. Same result. Manager do see >> the >> host as disconnected and powered off but take no action. I certainly >> miss >> something here. Please help ! >> >> Regards, >> >> Jean-Francois >> > >
Re: Recover VM after KVM host down (and HA not working) ?
I'd really like to get at the bottom of this.It does sound like the behavior mentioned in https://issues.apache.org/jira/browse/CLOUDSTACK-5582 but should be long fixed. One suspect log entry (be unrelated) I noticed is this recurring exception in the manger logs : ERROR [c.c.v.UserVmManagerImpl] (UserVm-ipfetch-3:ctx-d4c44c2b) (logid:16dd70ad) Caught the Exception in VmIpFetchTask Which I guess is caused by the use of an external DHCP so manager fails to determine a running VM IP.Which brings me to my next question how is a VM marked for HA actually monitored ? On Sat, Dec 23, 2017 at 3:38 AM, Eric Greenwrote: > If all else fails, change its state to the correct state in the MySQL > database and restart the management service. Sadly that is the only way I > could do it when my Cloudstack got confused and stuck an instance in an > intermediate state where I couldn't do anything with it. > > On Dec 22, 2017 at 9:09 AM, >> > wrote: > > Good morning, > > New to ACS and doing a POC with 4.10 on Centos 7 and KVM. > > Im trying to recover VMs after an host failure (powered off from OOB). > > Primary storage is NFS and IPMI is configured for the KVM hosts. Zone is > advanced mode with vlan separation and created a shared network with no > services since I wish to use an external DHCP. > > First, say I don't have a compute offering with HA enabled and a KVM host > goes down... I can't put it in maintenance mode while down and disabling > it have no effect on the state of the lost VMs. VM stays in running state > according to manager. What should I do to force restart on remaining > healthy hosts ? > > Then I enabled IPMI on all KVM hosts and attempted the same experience > with a compute offering with HA enabled. Same result. Manager do see the > host as disconnected and powered off but take no action. I certainly miss > something here. Please help ! > > Regards, > > Jean-Francois >
Re: Recover VM after KVM host down (and HA not working) ?
If all else fails, change its state to the correct state in the MySQL database and restart the management service. Sadly that is the only way I could do it when my Cloudstack got confused and stuck an instance in an intermediate state where I couldn't do anything with it. On Dec 22, 2017 at 9:09 AM, > wrote: Good morning, New to ACS and doing a POC with 4.10 on Centos 7 and KVM. Im trying to recover VMs after an host failure (powered off from OOB). Primary storage is NFS and IPMI is configured for the KVM hosts. Zone is advanced mode with vlan separation and created a shared network with no services since I wish to use an external DHCP. First, say I don't have a compute offering with HA enabled and a KVM host goes down... I can't put it in maintenance mode while down and disabling it have no effect on the state of the lost VMs. VM stays in running state according to manager. What should I do to force restart on remaining healthy hosts ? Then I enabled IPMI on all KVM hosts and attempted the same experience with a compute offering with HA enabled. Same result. Manager do see the host as disconnected and powered off but take no action. I certainly miss something here. Please help ! Regards, Jean-Francois
Recover VM after KVM host down (and HA not working) ?
Good morning, New to ACS and doing a POC with 4.10 on Centos 7 and KVM. Im trying to recover VMs after an host failure (powered off from OOB). Primary storage is NFS and IPMI is configured for the KVM hosts. Zone is advanced mode with vlan separation and created a shared network with no services since I wish to use an external DHCP. First, say I don't have a compute offering with HA enabled and a KVM host goes down... I can't put it in maintenance mode while down and disabling it have no effect on the state of the lost VMs. VM stays in running state according to manager. What should I do to force restart on remaining healthy hosts ? Then I enabled IPMI on all KVM hosts and attempted the same experience with a compute offering with HA enabled. Same result. Manager do see the host as disconnected and powered off but take no action. I certainly miss something here. Please help ! Regards, Jean-Francois