[ https://issues.apache.org/jira/browse/YARN-9823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16928312#comment-16928312 ]
Bibin A Chundatt commented on YARN-9823: ---------------------------------------- [~lichaojacobs] YARN-8434 should help you. > NodeManager cannot get right ResourceTrack address in Federation mode > --------------------------------------------------------------------- > > Key: YARN-9823 > URL: https://issues.apache.org/jira/browse/YARN-9823 > Project: Hadoop YARN > Issue Type: Bug > Components: federation, nodemanager > Affects Versions: 2.9.2 > Environment: h2. Hadoop: > Hadoop 2.9.2 (some line number may not be right because we have merged some > 3.0+ patch) > Security with Kerberos > configure from > [https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/Federation.html] > h2. Java: > Java(TM) SE Runtime Environment (build 1.8.0_77-b03) > Java HotSpot(TM) 64-Bit Server VM (build 25.77-b03, mixed mode) > Kerberos: > > > Reporter: qiwei huang > Priority: Major > > {{the NM will infinitely try to connect the wrong RM's resource tracker port}} > {quote}{{INFO [main:RetryInvocationHandler@411] - java.net.ConnectException: > Call From standby.rm.server/10.122.138.139 to }}{{standby.rm.server}}{{:8031 > failed on connection exception: java.net.ConnectException: Connection > refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused, while invoking > ResourceTrackerPBClientImpl.registerNodeManager over dev1 after 19 failover > attempts. Trying to failover after sleeping for 40497ms.}} > {quote} > > {{After change *yarn.client.failover-proxy-provider* to > *org.apache.hadoop.yarn.server.federation.failover.FederationRMFailoverProxyProvider*, > the ** NodeManager cannot find the right ResourceTracker address:}} > {quote}{{getRMHAId:233, HAUtil (org.apache.hadoop.yarn.conf)}} > {{getConfKeyForRMInstance:294, HAUtil (org.apache.hadoop.yarn.conf)}} > {{getConfValueForRMInstance:302, HAUtil (org.apache.hadoop.yarn.conf)}} > {{getConfValueForRMInstance:314, HAUtil (org.apache.hadoop.yarn.conf)}} > {{getSocketAddr:3341, YarnConfiguration (org.apache.hadoop.yarn.conf)}} > {{getRMAddress:77, ServerRMProxy (org.apache.hadoop.yarn.server.api)}} > {{run:144, FederationRMFailoverProxyProvider$1 > (org.apache.hadoop.yarn.server.federation.failover)}} > {{doPrivileged:-1, AccessController (java.security)}} > {{doAs:422, Subject (javax.security.auth)}} > {{doAs:1893, UserGroupInformation (org.apache.hadoop.security)}} > {{getProxyInternal:141, FederationRMFailoverProxyProvider > (org.apache.hadoop.yarn.server.federation.failover)}} > {{performFailover:192, FederationRMFailoverProxyProvider > (org.apache.hadoop.yarn.server.federation.failover)}} > {{failover:217, RetryInvocationHandler$ProxyDescriptor > (org.apache.hadoop.io.retry)}} > {{processRetryInfo:149, RetryInvocationHandler$Call > (org.apache.hadoop.io.retry)}} > {{processWaitTimeAndRetryInfo:142, RetryInvocationHandler$Call > (org.apache.hadoop.io.retry)}} > {{invokeOnce:107, RetryInvocationHandler$Call (org.apache.hadoop.io.retry)}} > {{invoke:359, RetryInvocationHandler (org.apache.hadoop.io.retry)}} > {{registerNodeManager:-1, $Proxy85 (com.sun.proxy)}} > {{registerWithRM:378, NodeStatusUpdaterImpl > (org.apache.hadoop.yarn.server.nodemanager)}} > {{serviceStart:252, NodeStatusUpdaterImpl > (org.apache.hadoop.yarn.server.nodemanager)}} > {{start:194, AbstractService (org.apache.hadoop.service)}} > {{serviceStart:121, CompositeService (org.apache.hadoop.service)}} > {{start:194, AbstractService (org.apache.hadoop.service)}} > {{initAndStartNodeManager:864, NodeManager > (org.apache.hadoop.yarn.server.nodemanager)}} > {{main:931, NodeManager (org.apache.hadoop.yarn.server.nodemanager)}} > {quote} > {{the Provider will try to find the main RM address on }}*{{getRMHAId:233,}}* > {{but it cannot find the right address because it can just return the local > Address: }}{{}} > {quote}{{if (!s.isUnresolved() && NetUtils.isLocalAddress(s.getAddress())) {}} > {{ currentRMId = rmId.trim();}} > {{ found++;}} > {{}}} > {quote} > {{If the NM and RM is on the same node, and the this RM is in standby > situation, the NM will }}{{infinitely}}{{ call RPC to RM}} -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org