[ https://issues.apache.org/jira/browse/YARN-8416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509015#comment-16509015 ]
Takanobu Asanuma commented on YARN-8416: ---------------------------------------- Moved this jira to YARN project. > YARN in HA not failing over to a new resource manager. > ------------------------------------------------------ > > Key: YARN-8416 > URL: https://issues.apache.org/jira/browse/YARN-8416 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 2.7.1 > Reporter: Danil Serdyuchenko > Priority: Major > > We are running YARN in HA mode. (rm1 and rm2) We hit an issue when recreating > one of the RMs. > # Recreated a standby RM (rm2), which gave it a new IP > # Stopped the active RM (rm1) > # NMs tried to failover to rm2, but were timing out because of the old ip. > # NMs reach the configured 30 failover retries and shutdown. > We get the following logs. > {noformat} > 18/06/06 15:36:32 WARN ipc.Client: Address change detected. Old: > yarnrm2/x.x.x.x:8031 New: yarnrm2/y.y.y.y:8031 > 18/06/06 15:36:32 INFO retry.RetryInvocationHandler: Exception while invoking > nodeHeartbeat of class ResourceTrackerPBClientImpl over rm2 after 25 fail > over attempts. Trying to fail over after sleeping for 37191ms. > org.apache.hadoop.net.ConnectTimeoutException: Call From ip-a-a-a-a/a.a.a.a > to yarnrm2:8031 failed on socket timeout exception: > org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while > waiting for channel to be ready for connect. ch : > java.nio.channels.SocketChannel[connection-pending > remote=yarnrm2/x.x.x.x:8031]; For more details see: > http://wiki.apache.org/hadoop/SocketTimeout > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:751) > at org.apache.hadoop.ipc.Client.call(Client.java:1480) > at org.apache.hadoop.ipc.Client.call(Client.java:1407) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) > at com.sun.proxy.$Proxy28.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy29.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:596) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis > timeout while waiting for channel to be ready for connect. ch : > java.nio.channels.SocketChannel[connection-pending > remote=yarnrm2/x.x.x.x:8031] > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495) > at > org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:609) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707) > at > org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1529) > at org.apache.hadoop.ipc.Client.call(Client.java:1446) > ... 12 more{noformat} > We get this and failover back to rm1 30 times until: > {noformat} > 18/06/06 15:39:44 WARN retry.RetryInvocationHandler: Exception while invoking > class > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat > over rm1. Not retrying because failovers (30) exceeded maximum allowed > (30){noformat} > From the logs it appears that the timeouts happen because it's trying to > connect to the old ip (x.x.x.x in the logs). Looking at the code of the > Client class, following the updateAddress method call we should expect a > retry with the new server ip ("Retrying connect to server ..." log) up to > ipc.client.connect.max.retries.on.timeouts times. However we never see the > retry logs and it just fails with exception. The above setting is set to > default 45 for all of our NMs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org