[
https://issues.apache.org/jira/browse/CLOUDSTACK-10234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nux closed CLOUDSTACK-10234.
----------------------------
Resolution: Later
This issue needs a more thorough rethink as it could lead to data corruption.
For now Host HA only works as long as the IPMIs are within reach.
> HA fails in cases of PSU failure.
> ---------------------------------
>
> Key: CLOUDSTACK-10234
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10234
> Project: CloudStack
> Issue Type: Improvement
> Security Level: Public(Anyone can view this level - this is the
> default.)
> Components: Management Server
> Affects Versions: 4.11.0.0
> Environment: 4.11 RC1, NFS storage, CentOS 7 management server and
> hypervisors
> Reporter: Nux
> Assignee: Rohit Yadav
> Priority: Major
> Labels: HA, KVM
>
> To simulate PSU failure I pulled the power from the server physically, HA
> fails to do the right thing and move the affected VMs to other HVs.
> I waited a good while, but alas nothing happened. The VM and VR running on
> the affected hypervisor were never moved to another one (I have another 2
> running).
> Is there any way to at least force the system to mark that HV as bad/offline?
> This is what I see in the management server logs:
> {code:java}
> Caused by: com.cloud.utils.exception.CloudRuntimeException: Out-of-band
> Management action (OFF) on host (57bf86e0-e1cd-484e-a4f1-78b3ca2da125) failed
> with error: Get Auth Capabilities error Error issuing Get Channel
> Authentication Capabilities request Error: Unable to establish IPMI v2 /
> RMCP+ session at
> org.apache.cloudstack.outofbandmanagement.OutOfBandManagementServiceImpl.executePowerOperation(OutOfBandManagementServiceImpl.java:423)
> at sun.reflect.GeneratedMethodAccessor199.invoke(Unknown Source) at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> ... 21 more 2018-01-16 17:00:13,396 WARN [o.a.c.alerts]
> (pool-5-thread-7:null) (logid:4f7299f6) AlertType:: 30 | dataCenterId:: 1 |
> podId:: 1 | clusterId:: null | message:: HA Fencing of host id=1, in dc id=1
> performed 2018-01-16 17:00:15,375 DEBUG [c.c.a.t.Request]
> (pool-2-thread-27:null) (logid:6b21a8c1) Seq 5-9115285645797884785: Sending
> \{ Cmd , MgmtId: 161334379813, via: 5(hv03.cloud.local), Ver: v1, Flags:
> 100011,
> [{"com.cloud.agent.api.CheckOnHostCommand":{"host":{"guid":"598d48ef-158d-3e14-ad68-8d02c9368ddf-LibvirtComputingResource","privateNetwork":{"ip":"172.16.25.101","netmask":"255.255.255.240","mac":"0c:c4:7a:40:8e:f6","isSecurityGroupEnabled":false},"publicNetwork":\{"ip":"172.16.25.101","netmask":"255.255.255.240","mac":"0c:c4:7a:40:8e:f6","isSecurityGroupEnabled":false},"storageNetwork1":\{"ip":"172.16.25.101","netmask":"255.255.255.240","mac":"0c:c4:7a:40:8e:f6","isSecurityGroupEnabled":false}},"wait":20}}]
> } 2018-01-16 17:00:15,380 DEBUG [c.c.a.t.Request] (pool-2-thread-5:null)
> (logid:bb993597) Seq 4-6582855280332112812: Sending \{ Cmd , MgmtId:
> 161334379813, via: 4(hv02.cloud.local), Ver: v1, Flags: 100011,
> [{"com.cloud.agent.api.CheckOnHostCommand":{"host":{"guid":"6ebb3010-9c49-3a9c-b620-ecbc9731aca2-LibvirtComputingResource","privateNetwork":{"ip":"172.16.25.100","netmask":"255.255.255.240","mac":"0c:c4:7a:40:8e:8e","isSecurityGroupEnabled":false},"publicNetwork":\{"ip":"172.16.25.100","netmask":"255.255.255.240","mac":"0c:c4:7a:40:8e:8e","isSecurityGroupEnabled":false},"storageNetwork1":\{"ip":"172.16.25.100","netmask":"255.255.255.240","mac":"0c:c4:7a:40:8e:8e","isSecurityGroupEnabled":false}},"wait":20}}]
> } 2018-01-16 17:00:15,423 DEBUG [c.c.a.t.Request]
> (AgentManager-Handler-4:null) (logid:) Seq 5-9115285645797884785: Processing:
> \{ Ans: , MgmtId: 161334379813, via: 5, Ver: v1, Flags: 10,
> [{"com.cloud.agent.api.Answer":{"result":false,"details":"Heart is
> beating...","wait":0}}] } 2018-01-16 17:00:15,423 DEBUG [c.c.a.t.Request]
> (pool-2-thread-27:null) (logid:6b21a8c1) Seq 5-9115285645797884785: Received:
> \{ Ans: , MgmtId: 161334379813, via: 5(hv03.cloud.local), Ver: v1, Flags:
> 10, { Answer } } 2018-01-16 17:00:15,423 DEBUG [c.c.a.m.AgentManagerImpl]
> (pool-2-thread-27:null) (logid:6b21a8c1) Details from executing class
> com.cloud.agent.api.CheckOnHostCommand: Heart is beating... 2018-01-16
> 17:00:15,427 DEBUG [c.c.a.t.Request] (AgentManager-Handler-6:null) (logid:)
> Seq 4-6582855280332112812: Processing: \{ Ans: , MgmtId: 161334379813, via:
> 4, Ver: v1, Flags: 10,
> [{"com.cloud.agent.api.Answer":{"result":false,"details":"Heart is
> beating...","wait":0}}] } 2018-01-16 17:00:15,427 DEBUG [c.c.a.t.Request]
> (pool-2-thread-5:null) (logid:bb993597) Seq 4-6582855280332112812: Received:
> \{ Ans: , MgmtId: 161334379813, via: 4(hv02.cloud.local), Ver: v1, Flags: 10,
> { Answer } } 2018-01-16 17:00:15,427 DEBUG [c.c.a.m.AgentManagerImpl]
> (pool-2-thread-5:null) (logid:bb993597) Details from executing class
> com.cloud.agent.api.CheckOnHostCommand: Heart is beating... 2018-01-16
> 17:00:16,217 INFO [o.a.c.f.j.i.AsyncJobManagerImpl]
> (AsyncJobMgr-Heartbeat-1:ctx-d9c2c841) (logid:1b093681) Begin cleanup expired
> async-jobs 2018-01-16 17:00:16,218 INFO [o.a.c.f.j.i.AsyncJobManagerImpl]
> (AsyncJobMgr-Heartbeat-1:ctx-d9c2c841) (logid:1b093681) End cleanup expired
> async-jobs 2018-01-16 17:00:17,392 WARN [o.a.c.o.PowerOperationTask]
> (pool-6-thread-29:null) (logid:f9788c38) Out-of-band management background
> task operation=STATUS for host id=1 failed with: Out-of-band Management
> action (STATUS) on host (57bf86e0-e1cd-484e-a4f1-78b3ca2da125) failed with
> error: Get Auth Capabilities error Error issuing Get Channel Authentication
> Capabilities request Error: Unable to establish IPMI v2 / RMCP+ session
> 2018-01-16 17:00:17,422 DEBUG [o.a.c.o.OutOfBandManagementServiceImpl]
> (pool-5-thread-6:ctx-65225bcc) (logid:665de20f) Out-of-band Management action
> (OFF) on host (57bf86e0-e1cd-484e-a4f1-78b3ca2da125) failed with error: Get
> Auth Capabilities error Error issuing Get Channel Authentication Capabilities
> request Error: Unable to establish IPMI v2 / RMCP+ session 2018-01-16
> 17:00:17,438 WARN [o.a.c.k.h.KVMHAProvider] (pool-5-thread-6:ctx-65225bcc)
> (logid:665de20f) OOBM service is not configured or enabled for this host
> hv01.cloud.local error is Out-of-band Management action (OFF) on host
> (57bf86e0-e1cd-484e-a4f1-78b3ca2da125) failed with error: Get Auth
> Capabilities error Error issuing Get Channel Authentication Capabilities
> request Error: Unable to establish IPMI v2 / RMCP+ session 2018-01-16
> 17:00:17,438 WARN [o.a.c.h.t.BaseHATask] (pool-5-thread-9:null)
> (logid:ff44841a) Exception occurred while running FenceTask on a resource:
> org.apache.cloudstack.ha.provider.HAFenceException: OOBM service is not
> configured or enabled for this host hv01.cloud.local
> org.apache.cloudstack.ha.provider.HAFenceException: OOBM service is not
> configured or enabled for this host hv01.cloud.local at
> org.apache.cloudstack.kvm.ha.KVMHAProvider.fence(KVMHAProvider.java:99)
> at org.apache.cloudstack.kvm.ha.KVMHAProvider.fence(KVMHAProvider.java:42)
> at org.apache.cloudstack.ha.task.FenceTask.performAction(FenceTask.java:42)
> at org.apache.cloudstack.ha.task.BaseHATask$1.call(BaseHATask.java:86)
> at org.apache.cloudstack.ha.task.BaseHATask$1.call(BaseHATask.java:83) at
> java.util.concurrent.FutureTask.run(FutureTask.java:266) at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748) Caused by:
> com.cloud.utils.exception.CloudRuntimeException: Out-of-band Management
> action (OFF) on host (57bf86e0-e1cd-484e-a4f1-78b3ca2da125) failed with
> error: Get Auth Capabilities error Error issuing Get Channel Authentication
> Capabilities request Error: Unable to establish IPMI v2 / RMCP+ session
> at
> org.apache.cloudstack.outofbandmanagement.OutOfBandManagementServiceImpl.executePowerOperation(OutOfBandManagementServiceImpl.java:423)
> at sun.reflect.GeneratedMethodAccessor199.invoke(Unknown Source) at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> ... 21 more 2018-01-16 17:00:17,439 WARN [o.a.c.alerts]
> (pool-5-thread-9:null) (logid:ff44841a) AlertType:: 30 | dataCenterId:: 1 |
> podId:: 1 | clusterId:: null | message:: HA Fencing of host id=1, in dc id=1
> performed 2018-01-16 17:00:17,903 DEBUG [o.a.c.s.SecondaryStorageManagerImpl]
> (secstorage-1:ctx-ccb33721) (logid:722404aa) Zone 1 is ready to launch
> secondary storage VM 2018-01-16 17:00:17,935 DEBUG
> [c.c.c.ConsoleProxyManagerImpl] (consoleproxy-1:ctx-22a69a02)
> (logid:393fab21) Zone 1 is ready to launch console proxy
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)