[ https://issues.apache.org/jira/browse/YARN-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246125#comment-14246125 ]
Steve Loughran commented on YARN-2710: -------------------------------------- Immediate failure trigger is that the retry handler says "Not retrying because failovers (5) exceeded maximum allowed (5)" If you look at the lines immediately above it, the RM is still bootstrapping hence not listening for connections. The test is probably just starting too early. Recommend changing the failure/retry policy to to add some backoff & maybe more retries {code} 014-12-14 11:40:06,693 INFO [Thread-442] resourcemanager.RMAuditLogger (RMAuditLogger.java:logSuccess(165)) - USER=jenkins OPERATION=refreshSuperUserGroupsConfiguration TARGET=AdminService RESULT=SUCCESS 2014-12-14 11:40:06,693 INFO [Thread-442] conf.Configuration (Configuration.java:getConfResourceAsInputStream(2240)) - found resource core-site.xml at file:/home/jenkins/jenkins-slave/workspace/Hadoop-Yarn-trunk-Java8/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/target/test-classes/core-site.xml 2014-12-14 11:40:06,694 INFO [Thread-442] security.Groups (Groups.java:refresh(248)) - clearing userToGroupsMap cache 2014-12-14 11:40:06,694 INFO [Thread-442] resourcemanager.RMAuditLogger (RMAuditLogger.java:logSuccess(165)) - USER=jenkins OPERATION=refreshUserToGroupsMappings TARGET=AdminService RESULT=SUCCESS 2014-12-14 11:40:06,694 INFO [Thread-442] resourcemanager.RMAuditLogger (RMAuditLogger.java:logSuccess(165)) - USER=jenkins OPERATION=transitionToActive TARGET=RMHAProtocolService RESULT=SUCCESS 2014-12-14 11:40:07,539 INFO [Thread-443] client.ConfiguredRMFailoverProxyProvider (ConfiguredRMFailoverProxyProvider.java:performFailover(100)) - Failing over to rm2 2014-12-14 11:40:07,541 WARN [Thread-443] retry.RetryInvocationHandler (RetryInvocationHandler.java:invoke(117)) - Exception while invoking class org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getContainers over rm2. Not retrying because failovers (5) exceeded maximum allowed (5) java.net.ConnectException: Call From asf908.gq1.ygridcore.net/67.195.81.152 to asf908.gq1.ygridcore.net:28032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:408) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731) at org.apache.hadoop.ipc.Client.call(Client.java:1472) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) at com.sun.proxy.$Proxy16.getContainers(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getContainers(ApplicationClientProtocolPBClientImpl.java:410) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) at com.sun.proxy.$Proxy17.getContainers(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getContainers(YarnClientImpl.java:654) at org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA.testGetContainersOnHA(TestApplicationClientProtocolOnHA.java:155) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:712) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521) at org.apache.hadoop.ipc.Client.call(Client.java:1438) ... 23 more {code} > RM HA tests failed intermittently on trunk > ------------------------------------------ > > Key: YARN-2710 > URL: https://issues.apache.org/jira/browse/YARN-2710 > Project: Hadoop YARN > Issue Type: Bug > Components: client > Reporter: Wangda Tan > Attachments: TestResourceTrackerOnHA-output.2.txt, > org.apache.hadoop.yarn.client.TestResourceTrackerOnHA-output.txt > > > Failure like, it can be happened in TestApplicationClientProtocolOnHA, > TestResourceTrackerOnHA, etc. > {code} > org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA > testGetApplicationAttemptsOnHA(org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA) > Time elapsed: 9.491 sec <<< ERROR! > java.net.ConnectException: Call From asf905.gq1.ygridcore.net/67.195.81.149 > to asf905.gq1.ygridcore.net:28032 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599) > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) > at > org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705) > at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521) > at org.apache.hadoop.ipc.Client.call(Client.java:1438) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy17.getApplicationAttempts(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationAttempts(ApplicationClientProtocolPBClientImpl.java:372) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) > at com.sun.proxy.$Proxy18.getApplicationAttempts(Unknown Source) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationAttempts(YarnClientImpl.java:583) > at > org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA.testGetApplicationAttemptsOnHA(TestApplicationClientProtocolOnHA.java:137) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)