[ 
https://issues.apache.org/jira/browse/YARN-8416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takanobu Asanuma moved HDFS-13669 to YARN-8416:
-----------------------------------------------

    Affects Version/s:     (was: 2.7.1)
                       2.7.1
                  Key: YARN-8416  (was: HDFS-13669)
              Project: Hadoop YARN  (was: Hadoop HDFS)

> YARN in HA not failing over to a new resource manager.
> ------------------------------------------------------
>
>                 Key: YARN-8416
>                 URL: https://issues.apache.org/jira/browse/YARN-8416
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.7.1
>            Reporter: Danil Serdyuchenko
>            Priority: Major
>
> We are running YARN in HA mode. (rm1 and rm2) We hit an issue when recreating 
> one of the RMs.
>  # Recreated a standby RM (rm2), which gave it a new IP
>  # Stopped the active RM (rm1)
>  # NMs tried to failover to rm2, but were timing out because of the old ip.
>  # NMs reach the configured 30 failover retries and shutdown.
> We get the following logs.
> {noformat}
> 18/06/06 15:36:32 WARN ipc.Client: Address change detected. Old: 
> yarnrm2/x.x.x.x:8031 New: yarnrm2/y.y.y.y:8031
> 18/06/06 15:36:32 INFO retry.RetryInvocationHandler: Exception while invoking 
> nodeHeartbeat of class ResourceTrackerPBClientImpl over rm2 after 25 fail 
> over attempts. Trying to fail over after sleeping for 37191ms.
> org.apache.hadoop.net.ConnectTimeoutException: Call From ip-a-a-a-a/a.a.a.a 
> to yarnrm2:8031 failed on socket timeout exception: 
> org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while 
> waiting for channel to be ready for connect. ch : 
> java.nio.channels.SocketChannel[connection-pending 
> remote=yarnrm2/x.x.x.x:8031]; For more details see:  
> http://wiki.apache.org/hadoop/SocketTimeout
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>         at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>         at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:751)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1480)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1407)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>         at com.sun.proxy.$Proxy28.nodeHeartbeat(Unknown Source)
>         at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
>         at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>         at com.sun.proxy.$Proxy29.nodeHeartbeat(Unknown Source)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:596)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis 
> timeout while waiting for channel to be ready for connect. ch : 
> java.nio.channels.SocketChannel[connection-pending 
> remote=yarnrm2/x.x.x.x:8031]
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:609)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
>         at 
> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1529)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1446)
>         ... 12 more{noformat}
> We get this and failover back to rm1 30 times until:
> {noformat}
> 18/06/06 15:39:44 WARN retry.RetryInvocationHandler: Exception while invoking 
> class 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat
>  over rm1. Not retrying because failovers (30) exceeded maximum allowed 
> (30){noformat}
> From the logs it appears that the timeouts happen because it's trying to 
> connect to the old ip (x.x.x.x in the logs). Looking at the code of the 
> Client class, following the updateAddress method call we should expect a 
> retry with the new server ip ("Retrying connect to server ..." log) up to 
> ipc.client.connect.max.retries.on.timeouts times. However we never see the 
> retry logs and it just fails with exception. The above setting is set to 
> default 45 for all of our NMs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to