[ 
https://issues.apache.org/jira/browse/YARN-11210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan updated YARN-11210:
------------------------------
         Hadoop Flags: Reviewed
     Target Version/s: 3.4.0
    Affects Version/s: 3.4.0

> Fix YARN RMAdminCLI retry logic for non-retryable kerberos configuration 
> exception
> ----------------------------------------------------------------------------------
>
>                 Key: YARN-11210
>                 URL: https://issues.apache.org/jira/browse/YARN-11210
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 3.4.0
>            Reporter: Kevin Wikant
>            Assignee: Kevin Wikant
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.4.0
>
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> h2. Description of Problem
> Applications which call YARN RMAdminCLI (i.e. YARN ResourceManager client) 
> synchronously can be blocked for up to 15 minutes with the default 
> configuration of "yarn.resourcemanager.connect.max-wait.ms"; this is not an 
> issue in of itself, but there is a non-retryable IllegalArgumentException 
> exception thrown within the YARN ResourceManager client that is getting 
> swallowed & treated as a retryable "connection exception" meaning that it 
> gets retried for 15 minutes.
> The purpose of this JIRA (and PR) is to modify the YARN client so that it 
> does not retry on this non-retryable exception.
> h2. Background Information
> YARN ResourceManager client treats connection exceptions as retryable & with 
> the default value of "yarn.resourcemanager.connect.max-wait.ms" will attempt 
> to connect to the ResourceManager for up to 15 minutes when facing 
> "connection exceptions". This arguably makes sense because connection 
> exceptions are in some cases transient & can be recovered from without any 
> action needed from the client. See example below where YARN ResourceManager 
> client was able to recover from connection issues that resulted from the 
> ResourceManager process being down.
> {quote}> yarn rmadmin -refreshNodes
> 22/06/28 14:40:17 INFO client.RMProxy: Connecting to ResourceManager at 
> /0.0.0.0:8033
> 22/06/28 14:40:18 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:40:19 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:40:20 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 2 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> ...
> 22/06/28 14:40:27 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:40:28 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:40:29 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> ...
> 22/06/28 14:40:37 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:40:37 INFO retry.RetryInvocationHandler: 
> java.net.ConnectException: Your endpoint configuration is wrong; For more 
> details see:  [http://wiki.apache.org/hadoop/UnsetHostnameOrPort], while 
> invoking ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over 
> null after 1 failover attempts. Trying to failover after sleeping for 41061ms.
> 22/06/28 14:41:19 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:41:20 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> ...
> 22/06/28 14:41:28 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:41:28 INFO retry.RetryInvocationHandler: 
> java.net.ConnectException: Your endpoint configuration is wrong; For more 
> details see:  [http://wiki.apache.org/hadoop/UnsetHostnameOrPort], while 
> invoking ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over 
> null after 2 failover attempts. Trying to failover after sleeping for 25962ms.
> >> Success is silent in client logs, but can be seen in the ResourceManager 
> >> logs <<
> {quote}
> Then there are cases where the YARN ResourceManager client will stop retrying 
> because it has encountered a non retryable exception. Some examples:
>  * client is configured with SIMPLE auth when ResourceManager is configured 
> with KERBEROS auth
>  ** this RemoteException is not a transient failure & will not recover 
> without the client taking action to modify their configuration, this is why 
> it fails immediately
>  ** the exception is coming from ResourceManager server-side & will occur 
> once the client successfully calls the ResourceManager
>  
> {quote}> yarn rmadmin -refreshNodes
> 22/07/12 15:20:33 INFO client.RMProxy: Connecting to ResourceManager at 
> /0.0.0.0:8033
> refreshNodes: org.apache.hadoop.security.AccessControlException: SIMPLE 
> authentication is not enabled.  Available:[KERBEROS]
> {quote}
>  
>  * client & server as configured with KERBEROS auth but the client has not 
> kinit
>  ** this SaslException is not a transient failure & will not recover without 
> the client taking action to modify their configuration, this is why it fails 
> immediately
>  ** the exception is coming from client-side & will occur before the client 
> even attempts to call the ResourceManager
> {quote}> yarn rmadmin -refreshNodes
> 22/07/12 15:20:33 INFO client.RMProxy: Connecting to ResourceManager at 
> /0.0.0.0:8033
> 22/07/12 15:20:33 WARN ipc.Client: Exception encountered while connecting to 
> the server
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
>         at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
>         at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:629)
>         at 
> org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:423)
>         at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:825)
>         at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:820)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:820)
>         at 
> org.apache.hadoop.ipc.Client$Connection.access$3700(Client.java:423)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1617)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1448)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1401)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>         at com.sun.proxy.$Proxy7.refreshNodes(Unknown Source)
>         at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes(ResourceManagerAdministrationProtocolPBClientImpl.java:145)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>         at com.sun.proxy.$Proxy8.refreshNodes(Unknown Source)
>         at 
> org.apache.hadoop.yarn.client.cli.RMAdminCLI.refreshNodes(RMAdminCLI.java:349)
>         at 
> org.apache.hadoop.yarn.client.cli.RMAdminCLI.refreshNodes(RMAdminCLI.java:423)
>         at 
> org.apache.hadoop.yarn.client.cli.RMAdminCLI.handleRefreshNodes(RMAdminCLI.java:917)
>         at 
> org.apache.hadoop.yarn.client.cli.RMAdminCLI.run(RMAdminCLI.java:816)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>         at 
> org.apache.hadoop.yarn.client.cli.RMAdminCLI.main(RMAdminCLI.java:1027)
> Caused by: GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos tgt)
>         at 
> sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:162)
>         at 
> sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:122)
>         at 
> sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:189)
>         at 
> sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:224)
>         at 
> sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
>         at 
> sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
>         at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192)
>         ... 34 more
> refreshNodes: Failed on local exception: java.io.IOException: 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]; Host Details : local host is: "0.0.0.0/0.0.0.0"; 
> destination host is: "0.0.0.0":8033;
> {quote}
> h2. The Problem
> When the client has:
>  * kerberos enabled by setting "hadoop.security.authentication = kerberos" in 
> "core-site.xml"
>  * a bad kerberos configuration where "yarn.resourcemanager.principal" is 
> unset or malformed in "yarn-site.xml"
> This bad configuration can never successfully connect to the ResourceManager 
> & therefore should result in a non-retryable failure.
> When the YARN ResourceManager client has this bad configuration an 
> IllegalArugmentException gets thrown (in 
> org.apache.hadoop.security.SaslRpcClient) but then swallowed by an 
> IOException (in org.apache.hadoop.ipc.Client) that gets treated as a 
> retryable failure & therefore will be retried for 15 minutes:
> {quote}> yarn rmadmin -refreshNodes
> 22/06/28 14:23:45 INFO client.RMProxy: Connecting to ResourceManager at 
> /0.0.0.0:8033
> 22/06/28 14:23:46 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:23:47 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:23:48 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 2 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> ...
> 22/06/28 14:23:54 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 8 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:23:55 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:23:56 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:23:57 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:23:58 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 2 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> ...
> 22/06/28 14:24:04 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 8 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:24:05 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:24:05 INFO retry.RetryInvocationHandler: 
> java.net.ConnectException: Your endpoint configuration is wrong; For more 
> details see:  [http://wiki.apache.org/hadoop/UnsetHostnameOrPort], while 
> invoking ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over 
> null after 1 failover attempts. Trying to failover after sleeping for 27166ms.
> 22/06/28 14:24:32 INFO retry.RetryInvocationHandler: java.io.IOException: 
> Failed on local exception: java.io.IOException: Couldn't set up IO streams: 
> java.lang.IllegalArgumentException: Failed to specify server's Kerberos 
> principal name; Host Details : local host is: "0.0.0.0/0.0.0.0"; destination 
> host is: "0.0.0.0":8033; , while invoking 
> ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null 
> after 2 failover attempts. Trying to failover after sleeping for 22291ms.
> 22/06/28 14:24:54 INFO retry.RetryInvocationHandler: java.io.IOException: 
> Failed on local exception: java.io.IOException: Couldn't set up IO streams: 
> java.lang.IllegalArgumentException: Failed to specify server's Kerberos 
> principal name; Host Details : local host is: "0.0.0.0/0.0.0.0"; destination 
> host is: "0.0.0.0":8033; , while invoking 
> ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null 
> after 3 failover attempts. Trying to failover after sleeping for 24773ms.
> 22/06/28 14:25:19 INFO retry.RetryInvocationHandler: java.io.IOException: 
> Failed on local exception: java.io.IOException: Couldn't set up IO streams: 
> java.lang.IllegalArgumentException: Failed to specify server's Kerberos 
> principal name; Host Details : local host is: "0.0.0.0/0.0.0.0"; destination 
> host is: "0.0.0.0":8033; , while invoking 
> ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null 
> after 4 failover attempts. Trying to failover after sleeping for 39187ms.
> ...
> 22/06/28 14:36:50 INFO retry.RetryInvocationHandler: java.io.IOException: 
> Failed on local exception: java.io.IOException: Couldn't set up IO streams: 
> java.lang.IllegalArgumentException: Failed to specify server's Kerberos 
> principal name; Host Details : local host is: "0.0.0.0/0.0.0.0"; destination 
> host is: "0.0.0.0":8033; , while invoking 
> ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null 
> after 26 failover attempts. Trying to failover after sleeping for 26235ms.
> 22/06/28 14:37:16 INFO retry.RetryInvocationHandler: java.io.IOException: 
> Failed on local exception: java.io.IOException: Couldn't set up IO streams: 
> java.lang.IllegalArgumentException: Failed to specify server's Kerberos 
> principal name; Host Details : local host is: "0.0.0.0/0.0.0.0"; destination 
> host is: "0.0.0.0":8033; , while invoking 
> ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null 
> after 27 failover attempts. Trying to failover after sleeping for 40535ms.
> 22/06/28 14:37:57 INFO retry.RetryInvocationHandler: java.io.IOException: 
> Failed on local exception: java.io.IOException: Couldn't set up IO streams: 
> java.lang.IllegalArgumentException: Failed to specify server's Kerberos 
> principal name; Host Details : local host is: "0.0.0.0/0.0.0.0"; destination 
> host is: "0.0.0.0":8033; , while invoking 
> ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null 
> after 28 failover attempts. Trying to failover after sleeping for 26721ms.
> 22/06/28 14:38:23 INFO retry.RetryInvocationHandler: java.io.IOException: 
> Failed on local exception: java.io.IOException: Couldn't set up IO streams: 
> java.lang.IllegalArgumentException: Failed to specify server's Kerberos 
> principal name; Host Details : local host is: "0.0.0.0/0.0.0.0"; destination 
> host is: "0.0.0.0":8033; , while invoking 
> ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null 
> after 29 failover attempts. Trying to failover after sleeping for 27641ms.
> refreshNodes: Failed on local exception: java.io.IOException: Couldn't set up 
> IO streams: java.lang.IllegalArgumentException: Failed to specify server's 
> Kerberos principal name; Host Details : local host is: "0.0.0.0/0.0.0.0"; 
> destination host is: "0.0.0.0":8033;
> {quote}
> This non-retryable failure should not be treated as a retryable "connection 
> failure"
> h2. The Solution
> Surface the IllegalArgumentException to the RetryInvocationHandler & have 
> YARN RMProxy treat IllegalArugmentException as non-retryable
> Note that surfacing IllegalArgumentException has the side-effect of causing 
> the [command usage to be printed 
> here|https://github.com/apache/hadoop/blob/c0bdba8face85fbd40f5d7ba46af11e24a8ef25b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/cli/RMAdminCLI.java#L790]
> {quote}...
> refreshNodes: Failed to specify server's Kerberos principal name
> Usage: yarn rmadmin [-refreshNodes [-g|graceful [timeout in seconds] 
> -client|server]]
> Generic options supported are:
> -conf <configuration file>        specify an application configuration file
> -D <property=value>               define a value for a given property
> -fs <[file:///]|hdfs://namenode:port> specify default filesystem URL to use, 
> overrides 'fs.defaultFS' property from configurations.
> -jt <local|resourcemanager:port>  specify a ResourceManager
> -files <file1,...>                specify a comma-separated list of files to 
> be copied to the map reduce cluster
> -libjars <jar1,...>               specify a comma-separated list of jar files 
> to be included in the classpath
> -archives <archive1,...>          specify a comma-separated list of archives 
> to be unarchived on the compute machines
> The general command line syntax is:
> command [genericOptions] [commandOptions]
> {quote}
> To resolve this issue the IllegalArgumentException is swallowed & surfaced as 
> a KerberosAuthException, this chosen because it already gets [treated as 
> non-retryable in FailoverOnNetworkExceptionRetry]()
> Note that in terms of RetryPolicy:
>  * non-HA YARN ResourceManager client should use 
> OtherThanRemoteExceptionDependentRetry (but because of a bug uses 
> FailoverOnNetworkExceptionRetry)
>  * HA YARN ResourceManager client uses FailoverOnNetworkExceptionRetry
> The result of this change is a much quicker failure when the YARN client is 
> misconfigured:
>  * non-HA YARN ResourceManager client 
>  
> {quote}> yarn rmadmin -refreshNodes
> 22/07/13 17:36:03 INFO client.RMProxy: Connecting to ResourceManager at 
> /0.0.0.0:8033
> 22/07/13 17:36:03 WARN ipc.Client: Exception encountered while connecting to 
> the server
> javax.security.sasl.SaslException: Bad Kerberos server principal 
> configuration [Caused by java.lang.IllegalArgumentException: Failed to 
> specify server's Kerberos principal name]
>         at 
> org.apache.hadoop.security.SaslRpcClient.createSaslClient(SaslRpcClient.java:237)
>         at 
> org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:159)
>         at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:397)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:630)
>         at 
> org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:424)
>         at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:825)
>         at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:821)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:821)
>         at 
> org.apache.hadoop.ipc.Client$Connection.access$3700(Client.java:424)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1612)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1442)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>         at com.sun.proxy.$Proxy7.refreshNodes(Unknown Source)
>         at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes(ResourceManagerAdministrationProtocolPBClientImpl.java:145)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>         at com.sun.proxy.$Proxy8.refreshNodes(Unknown Source)
>         at 
> org.apache.hadoop.yarn.client.cli.RMAdminCLI.refreshNodes(RMAdminCLI.java:349)
>         at 
> org.apache.hadoop.yarn.client.cli.RMAdminCLI.refreshNodes(RMAdminCLI.java:423)
>         at 
> org.apache.hadoop.yarn.client.cli.RMAdminCLI.handleRefreshNodes(RMAdminCLI.java:917)
>         at 
> org.apache.hadoop.yarn.client.cli.RMAdminCLI.run(RMAdminCLI.java:816)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>         at 
> org.apache.hadoop.yarn.client.cli.RMAdminCLI.main(RMAdminCLI.java:1027)
> Caused by: java.lang.IllegalArgumentException: Failed to specify server's 
> Kerberos principal name
>         at 
> org.apache.hadoop.security.SaslRpcClient.getServerPrincipal(SaslRpcClient.java:332)
>         at 
> org.apache.hadoop.security.SaslRpcClient.createSaslClient(SaslRpcClient.java:233)
>         ... 35 more
> refreshNodes: Failed on local exception: java.io.IOException: 
> javax.security.sasl.SaslException: Bad Kerberos server principal 
> configuration [Caused by java.lang.IllegalArgumentException: Failed to 
> specify server's Kerberos principal name]; Host Details : local host is: 
> "0.0.0.0/0.0.0.0"; destination host is: "0.0.0.0":8033;
> {quote}
> *  HA YARN ResourceManager client
>  
>  
> {quote}> yarn rmadmin -refreshNodes
> 22/07/13 17:37:50 INFO client.RMProxy: Connecting to ResourceManager at 
> /0.0.0.0:8033
> 22/07/13 17:37:50 WARN ipc.Client: Exception encountered while connecting to 
> the server
> javax.security.sasl.SaslException: Bad Kerberos server principal 
> configuration [Caused by java.lang.IllegalArgumentException: Failed to 
> specify server's Kerberos principal name]
>         at 
> org.apache.hadoop.security.SaslRpcClient.createSaslClient(SaslRpcClient.java:237)
>         at 
> org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:159)
>         at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:397)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:630)
>         at 
> org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:424)
>         at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:825)     
>    at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:821)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:821)
>         at 
> org.apache.hadoop.ipc.Client$Connection.access$3700(Client.java:424)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1612)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1442)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>         at com.sun.proxy.$Proxy7.refreshNodes(Unknown Source)
>         at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes(ResourceManagerAdministrationProtocolPBClientImpl.java:145)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>         at com.sun.proxy.$Proxy8.refreshNodes(Unknown Source)
>         at 
> org.apache.hadoop.yarn.client.cli.RMAdminCLI.refreshNodes(RMAdminCLI.java:349)
>         at 
> org.apache.hadoop.yarn.client.cli.RMAdminCLI.refreshNodes(RMAdminCLI.java:423)
>         at 
> org.apache.hadoop.yarn.client.cli.RMAdminCLI.handleRefreshNodes(RMAdminCLI.java:917)
>         at 
> org.apache.hadoop.yarn.client.cli.RMAdminCLI.run(RMAdminCLI.java:816)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>         at 
> org.apache.hadoop.yarn.client.cli.RMAdminCLI.main(RMAdminCLI.java:1027)
> Caused by: java.lang.IllegalArgumentException: Failed to specify server's 
> Kerberos principal name
>         at 
> org.apache.hadoop.security.SaslRpcClient.getServerPrincipal(SaslRpcClient.java:332)
>         at 
> org.apache.hadoop.security.SaslRpcClient.createSaslClient(SaslRpcClient.java:233)
>         ... 35 more
> refreshNodes: Failed on local exception: java.io.IOException: 
> javax.security.sasl.SaslException: Bad Kerberos server principal 
> configuration [Caused by java.lang.IllegalArgumentException: Failed to 
> specify server's Kerberos principal name]; Host Details : local host is: 
> "0.0.0.0/0.0.0.0"; destination host is: "0.0.0.0":8033;
> {quote}
> h2. Other Notes
> The YARN RMProxy will return separate RetryPolicies for HA & non-HA, but the 
> YARN client will always use the HA policy because a configuration related to 
> [Federation 
> Failover|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java#L102]
>  is [enabled by 
> default|https://github.com/apache/hadoop/blob/e044a46f97dcc7998dc0737f15cf3956dca170c4/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java#L3901].
>  This is presumably a bug because YARN Federation is not enabled for the 
> cluster I am testing on.
> The fix is to modify HAUtil.isFederationFailoverEnabled to check if 
> "yarn.federation.enabled" (default false) in addition to checking if 
> "yarn.federation.failover.enabled" (default true).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to