[ https://issues.apache.org/jira/browse/CLOUDSTACK-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sangeetha Hariharan closed CLOUDSTACK-2428. ------------------------------------------- Resolution: Not A Problem Koushik, Thanks for the analysis. The NTP server that the xenservers are configure with is not really responding and I see a time lag between the 2 hosts. This could be the reason why there was a delay in the host being marked “Down”. I am closing this issue since the hosts status did get marked as “Down” eventually , even though these errors are seen in the management server logs. Thanks Sangeetha > HA - When the master host is disconnected , the host status contines to > remain in "Up" state because of > com.cloud.utils.exception.CloudRuntimeException: Unable to reset master of > slave > ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: CLOUDSTACK-2428 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-2428 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: Management Server > Affects Versions: 4.2.0, 4.3.0 > Environment: Build from pvaln > Reporter: Sangeetha Hariharan > Assignee: Koushik Das > Priority: Critical > Fix For: 4.3.0, 4.2.0 > > Attachments: hostdown.rar, hostdown.rar, logs.rar, logs_7_29, > management-server.rar, masterhostdown.rar > > > 1. Advance zone with 1 cluster with 2 hosts. Create Shared network with > private vlan. > 2. Deploy few HA enabled Vms in this network. > 3. pull network cable for one of the host. > When cloudstack detects that the host is disconnected , it is not able to out > the host in disconnected state and start HA for Vms that are HA enabeld, > I see the following exception in the management server logs: > 2013-05-09 17:15:55,576 DEBUG [agent.manager.DirectAgentAttache] > (DirectAgent-267:null) Seq 1-1435828229: Executing request > 2013-05-09 17:15:55,602 DEBUG [xen.resource.XenServerConnectionPool] > (DirectAgent-267:null) Catch Exception: > com.xensource.xenapi.Types$HostOffline Host is offline 10.223.81.62 due to > You attempted an operation which involves a host which could not be contacted. > 2013-05-09 17:15:55,603 DEBUG [xen.resource.XenServerConnectionPool] > (DirectAgent-267:null) Trying to reset master of slave 10.223.81.62 to > 10.223.81.61 > 2013-05-09 17:16:02,319 WARN [xen.resource.CitrixResourceBase] > (DirectAgent-265:null) can not ping xenserver > 520d4994-8b1f-4dda-b51d-2ee63750abf6 > 2013-05-09 17:16:02,319 WARN [agent.manager.DirectAgentAttache] > (DirectAgent-265:null) Unable to get current status on 1 > 2013-05-09 17:16:02,321 INFO [agent.manager.AgentManagerImpl] > (AgentTaskPool-11:null) Investigating why host 1 has disconnected with event > AgentDisconnected > 2013-05-09 17:16:02,321 DEBUG [agent.manager.AgentManagerImpl] > (AgentTaskPool-11:null) checking if agent (1) is alive > 2013-05-09 17:16:02,323 DEBUG [agent.transport.Request] > (AgentTaskPool-11:null) Seq 1-1435828294: Sending { Cmd , MgmtId: > 7647994577963, via: 1, Ver: v1, Flags: 100011, > [{"CheckHealthCommand":{"wait":50}}] } > 2013-05-09 17:16:02,323 DEBUG [agent.transport.Request] > (AgentTaskPool-11:null) Seq 1-1435828294: Executing: { Cmd , MgmtId: > 7647994577963, via: 1, Ver: v1, Flags: 100011, > [{"CheckHealthCommand":{"wait":50}}] } > 2013-05-09 17:16:02,323 DEBUG [agent.manager.DirectAgentAttache] > (DirectAgent-271:null) Seq 1-1435828294: Executing request > 2013-05-09 17:16:04,035 DEBUG [agent.manager.AgentAttache] > (AgentTaskPool-10:null) Seq 6-474349576: Waiting some more time because this > is the current command > 2013-05-09 17:16:04,040 DEBUG [xen.resource.XenServerConnectionPool] > (DirectAgent-268:null) localLogout has problem Failed to read server's > response: connect timed out > 2013-05-09 17:16:04,040 WARN [agent.manager.DirectAgentAttache] > (DirectAgent-268:null) Seq 1-1435828292: Exception Caught while executing > command > com.cloud.utils.exception.CloudRuntimeException: Unable to reset master of > slave 10.223.81.62 to 10.223.81.61 due to org.apache.xmlrpc.XmlRpcException: > Failed to read server's response: connect timed out > at > com.cloud.hypervisor.xen.resource.XenServerConnectionPool.PoolEmergencyResetMaster(XenServerConnectionPool.java:443) > at > com.cloud.hypervisor.xen.resource.XenServerConnectionPool.connect(XenServerConnectionPool.java:661) > at > com.cloud.hypervisor.xen.resource.CitrixResourceBase.getConnection(CitrixResourceBase.java:5639) > at > com.cloud.hypervisor.xen.resource.CitrixResourceBase.execute(CitrixResourceBase.java:1682) > at > com.cloud.hypervisor.xen.resource.CitrixResourceBase.executeRequest(CitrixResourceBase.java:524) > at > com.cloud.hypervisor.xen.resource.XenServer56Resource.executeRequest(XenServer56Resource.java:73) > at > com.cloud.hypervisor.xen.resource.XenServer610Resource.executeRequest(XenServer610Resource.java:102) > at > com.cloud.agent.manager.DirectAgentAttache$Task.run(DirectAgentAttache.java:186) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:165) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:679) > 2013-05-09 17:16:04,041 DEBUG [agent.manager.DirectAgentAttache] > (DirectAgent-268:null) Seq 1-1435828292: Response Received: > 2013-05-09 17:16:04,041 DEBUG [agent.transport.Request] > (DirectAgent-268:null) Seq 1-1435828292: Processing: { Ans: , MgmtId: > 7647994577963, via: 1, Ver: v1, Flags: 10, > [{"Answer":{"result":false,"details":"com.cloud.utils.exception.CloudRuntimeException: > Unable to reset master of slave 10.223.81.62 to 10.223.81.61 due to > org.apache.xmlrpc.XmlRpcException: Failed to read server's response: connect > timed out","wait":0}}] } > 2013-05-09 17:16:04,041 DEBUG [agent.transport.Request] > (AgentTaskPool-5:null) Seq 1-1435828292: Received: { Ans: , MgmtId: > 7647994577963, via: 1, Ver: v1, Flags: 10, { Answer } } > 2013-05-09 17:16:04,041 DEBUG [cloud.ha.AbstractInvestigatorImpl] > (AgentTaskPool-5:null) host (10.223.81.50) cannot be pinged, returning null > ('I don't know') > 2013-05-09 17:16:04,041 DEBUG [cloud.ha.UserVmDomRInvestigator] > (AgentTaskPool-5:null) sending ping from (5) to agent's host ip address > (10.223.81.50) > 2013-05-09 17:16:04,043 DEBUG [agent.transport.Request] > (AgentTaskPool-5:null) Seq 5-2082341067: Sending { Cmd , MgmtId: > 7647994577963, via: 5, Ver: v1, Flags: 100011, > [{"PingTestCommand":{"_computingHostIp":"10.223.81.50","wait":20}}] } > 2013-05-09 17:16:04,043 DEBUG [agent.transport.Request] > (AgentTaskPool-5:null) Seq 5-2082341067: Executing: { Cmd , MgmtId: > 7647994577963, via: 5, Ver: v1, Flags: 100011, > [{"PingTestCommand":{"_computingHostIp":"10.223.81.50","wait":20}}] } > 2013-05-09 17:16:04,043 DEBUG [agent.manager.DirectAgentAttache] > (DirectAgent-272:null) Seq 5-2082341067: Executing request > 2013-05-09 17:16:04,053 DEBUG [xen.resource.XenServerConnectionPool] > (DirectAgent-91:null) localLogout has problem Failed to read server's > response: connect timed out > 2013-05-09 17:16:04,053 WARN [agent.manager.DirectAgentAttache] > (DirectAgent-91:null) Seq 1-1435828293: Exception Caught while executing > command > com.cloud.utils.exception.CloudRuntimeException: Unable to reset master of > slave 10.223.81.62 to 10.223.81.61 due to org.apache.xmlrpc.XmlRpcException: > Failed to read server's response: connect timed out > at > com.cloud.hypervisor.xen.resource.XenServerConnectionPool.PoolEmergencyResetMaster(XenServerConnectionPool.java:443) > at > com.cloud.hypervisor.xen.resource.XenServerConnectionPool.connect(XenServerConnectionPool.java:661) > at > com.cloud.hypervisor.xen.resource.CitrixResourceBase.getConnection(CitrixResourceBase.java:5639) > at > com.cloud.hypervisor.xen.resource.CitrixResourceBase.execute(CitrixResourceBase.java:1682) > at > com.cloud.hypervisor.xen.resource.CitrixResourceBase.executeRequest(CitrixResourceBase.java:524) > at > com.cloud.hypervisor.xen.resource.XenServer56Resource.executeRequest(XenServer56Resource.java:73) > at > com.cloud.hypervisor.xen.resource.XenServer610Resource.executeRequest(XenServer610Resource.java:102) > at > com.cloud.agent.manager.DirectAgentAttache$Task.run(DirectAgentAttache.java:186) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:165) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:679) > 2013-05-09 17:16:04,054 DEBUG [agent.manager.DirectAgentAttache] > (DirectAgent-91:null) Seq 1-1435828293: Response Received: > 2013-05-09 17:16:04,054 DEBUG [agent.transport.Request] (DirectAgent-91:null) > Seq 1-1435828293: Processing: { Ans: , MgmtId: 7647994577963, via: 1, Ver: > v1, Flags: 10, > [{"Answer":{"result":false,"details":"com.cloud.utils.exception.CloudRuntimeException: > Unable to reset master of slave 10.223.81.62 to 10.223.81.61 due to > org.apache.xmlrpc.XmlRpcException: Failed to read server's response: connect > timed out","wait":0}}] } > 2013-05-09 17:16:04,055 DEBUG [agent.transport.Request] > (AgentTaskPool-7:null) Seq 1-1435828293: Received: { Ans: , MgmtId: > 7647994577963, via: 1, Ver: v1, Flags: 10, { Answer } } > 2013-05-09 17:16:04,055 DEBUG [cloud.ha.AbstractInvestigatorImpl] > (AgentTaskPool-7:null) host (10.223.81.52) cannot be pinged, returning null > ('I don't know') > 2013-05-09 17:16:04,055 DEBUG [cloud.ha.UserVmDomRInvestigator] > (AgentTaskPool-7:null) sending ping from (5) to agent's host ip address > (10.223.81.52) > 2013-05-09 17:16:04,057 DEBUG [agent.transport.Request] > (AgentTaskPool-7:null) Seq 5-2082341068: Sending { Cmd , MgmtId: > 7647994577963, via: 5, Ver: v1, Flags: 100011, > [{"PingTestCommand":{"_computingHostIp":"10.223.81.52","wait":20}}] } > 2013-05-09 17:16:04,057 DEBUG [agent.manager.AgentAttache] > (AgentTaskPool-14:null) Seq 3-1752367195: Waiting some more time because this > is the current command > 2013-05-09 17:16:04,057 DEBUG [agent.transport.Request] > (AgentTaskPool-7:null) Seq 5-2082341068: Executing: { Cmd , MgmtId: > 7647994577963, via: 5, Ver: v1, Flags: 100011, > [{"PingTestCommand":{"_computingHostIp":"10.223.81.52","wait":20}}] } > 2013-05-09 17:16:04,057 DEBUG [agent.manager.DirectAgentAttache] > (DirectAgent-91:null) Seq 5-2082341068: Executing request > 2013-05-09 17:16:05,175 DEBUG [storage.secondary.SecondaryStorageManagerImpl] > (secstorage-1:null) Zone 1 is ready to launch secondary storage VM > 2013-05-09 17:16:05,614 DEBUG [xen.resource.XenServerConnectionPool] > (DirectAgent-267:null) localLogout has problem Failed to read server's > response: connect timed out > 2013-05-09 17:16:05,614 WARN [agent.manager.DirectAgentAttache] > (DirectAgent-267:null) Seq 1-1435828229: Exception Caught while executing > command > com.cloud.utils.exception.CloudRuntimeException: Unable to reset master of > slave 10.223.81.62 to 10.223.81.61 due to org.apache.xmlrpc.XmlRpcException: > Failed to read server's response: connect timed out > at > com.cloud.hypervisor.xen.resource.XenServerConnectionPool.PoolEmergencyResetMaster(XenServerConnectionPool.java:443) > at > com.cloud.hypervisor.xen.resource.XenServerConnectionPool.connect(XenServerConnectionPool.java:661) > at > com.cloud.hypervisor.xen.resource.CitrixResourceBase.getConnection(CitrixResourceBase.java:5639) > at > com.cloud.hypervisor.xen.resource.CitrixResourceBase.execute(CitrixResourceBase.java:7725) > at > com.cloud.hypervisor.xen.resource.CitrixResourceBase.executeRequest(CitrixResourceBase.java:570) > at > com.cloud.hypervisor.xen.resource.XenServer56Resource.executeRequest(XenServer56Resource.java:73) > at > com.cloud.hypervisor.xen.resource.XenServer610Resource.executeRequest(XenServer610Resource.java:102) > at > com.cloud.agent.manager.DirectAgentAttache$Task.run(DirectAgentAttache.java:186) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at > java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:679) > 2013-05-09 17:16:05,615 DEBUG [agent.manager.DirectAgentAttache] > (DirectAgent-267:null) Seq 1-1435828229: Response Received: > 2013-05-09 17:16:05,615 DEBUG [agent.transport.Request] > (DirectAgent-267:null) Seq 1-1435828229: Processing: { Ans: , MgmtId: > 7647994577963, via: 1, Ver: v1, Flags: 10, > [{"Answer":{"result":false,"details":"com.cloud.utils.exception.CloudRuntimeException: > Unable to reset master of slave 10.223.81.62 to 10.223.81.61 due to > org.apache.xmlrpc.XmlRpcException: Failed to read server's response: connect > timed out","wait":0}}] } > 2013-05-09 17:16:05,704 DEB -- This message was sent by Atlassian JIRA (v6.1.4#6159)