[ https://issues.apache.org/jira/browse/YARN-11210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shilun Fan updated YARN-11210: ------------------------------ Hadoop Flags: Reviewed Target Version/s: 3.4.0 Affects Version/s: 3.4.0 > Fix YARN RMAdminCLI retry logic for non-retryable kerberos configuration > exception > ---------------------------------------------------------------------------------- > > Key: YARN-11210 > URL: https://issues.apache.org/jira/browse/YARN-11210 > Project: Hadoop YARN > Issue Type: Bug > Components: client > Affects Versions: 3.4.0 > Reporter: Kevin Wikant > Assignee: Kevin Wikant > Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > h2. Description of Problem > Applications which call YARN RMAdminCLI (i.e. YARN ResourceManager client) > synchronously can be blocked for up to 15 minutes with the default > configuration of "yarn.resourcemanager.connect.max-wait.ms"; this is not an > issue in of itself, but there is a non-retryable IllegalArgumentException > exception thrown within the YARN ResourceManager client that is getting > swallowed & treated as a retryable "connection exception" meaning that it > gets retried for 15 minutes. > The purpose of this JIRA (and PR) is to modify the YARN client so that it > does not retry on this non-retryable exception. > h2. Background Information > YARN ResourceManager client treats connection exceptions as retryable & with > the default value of "yarn.resourcemanager.connect.max-wait.ms" will attempt > to connect to the ResourceManager for up to 15 minutes when facing > "connection exceptions". This arguably makes sense because connection > exceptions are in some cases transient & can be recovered from without any > action needed from the client. See example below where YARN ResourceManager > client was able to recover from connection issues that resulted from the > ResourceManager process being down. > {quote}> yarn rmadmin -refreshNodes > 22/06/28 14:40:17 INFO client.RMProxy: Connecting to ResourceManager at > /0.0.0.0:8033 > 22/06/28 14:40:18 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:40:19 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:40:20 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 2 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > ... > 22/06/28 14:40:27 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:40:28 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:40:29 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > ... > 22/06/28 14:40:37 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:40:37 INFO retry.RetryInvocationHandler: > java.net.ConnectException: Your endpoint configuration is wrong; For more > details see: [http://wiki.apache.org/hadoop/UnsetHostnameOrPort], while > invoking ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over > null after 1 failover attempts. Trying to failover after sleeping for 41061ms. > 22/06/28 14:41:19 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:41:20 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > ... > 22/06/28 14:41:28 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:41:28 INFO retry.RetryInvocationHandler: > java.net.ConnectException: Your endpoint configuration is wrong; For more > details see: [http://wiki.apache.org/hadoop/UnsetHostnameOrPort], while > invoking ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over > null after 2 failover attempts. Trying to failover after sleeping for 25962ms. > >> Success is silent in client logs, but can be seen in the ResourceManager > >> logs << > {quote} > Then there are cases where the YARN ResourceManager client will stop retrying > because it has encountered a non retryable exception. Some examples: > * client is configured with SIMPLE auth when ResourceManager is configured > with KERBEROS auth > ** this RemoteException is not a transient failure & will not recover > without the client taking action to modify their configuration, this is why > it fails immediately > ** the exception is coming from ResourceManager server-side & will occur > once the client successfully calls the ResourceManager > > {quote}> yarn rmadmin -refreshNodes > 22/07/12 15:20:33 INFO client.RMProxy: Connecting to ResourceManager at > /0.0.0.0:8033 > refreshNodes: org.apache.hadoop.security.AccessControlException: SIMPLE > authentication is not enabled. Available:[KERBEROS] > {quote} > > * client & server as configured with KERBEROS auth but the client has not > kinit > ** this SaslException is not a transient failure & will not recover without > the client taking action to modify their configuration, this is why it fails > immediately > ** the exception is coming from client-side & will occur before the client > even attempts to call the ResourceManager > {quote}> yarn rmadmin -refreshNodes > 22/07/12 15:20:33 INFO client.RMProxy: Connecting to ResourceManager at > /0.0.0.0:8033 > 22/07/12 15:20:33 WARN ipc.Client: Exception encountered while connecting to > the server > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:629) > at > org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:423) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:825) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:820) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:820) > at > org.apache.hadoop.ipc.Client$Connection.access$3700(Client.java:423) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1617) > at org.apache.hadoop.ipc.Client.call(Client.java:1448) > at org.apache.hadoop.ipc.Client.call(Client.java:1401) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy7.refreshNodes(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes(ResourceManagerAdministrationProtocolPBClientImpl.java:145) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy8.refreshNodes(Unknown Source) > at > org.apache.hadoop.yarn.client.cli.RMAdminCLI.refreshNodes(RMAdminCLI.java:349) > at > org.apache.hadoop.yarn.client.cli.RMAdminCLI.refreshNodes(RMAdminCLI.java:423) > at > org.apache.hadoop.yarn.client.cli.RMAdminCLI.handleRefreshNodes(RMAdminCLI.java:917) > at > org.apache.hadoop.yarn.client.cli.RMAdminCLI.run(RMAdminCLI.java:816) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) > at > org.apache.hadoop.yarn.client.cli.RMAdminCLI.main(RMAdminCLI.java:1027) > Caused by: GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos tgt) > at > sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:162) > at > sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:122) > at > sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:189) > at > sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:224) > at > sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212) > at > sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179) > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192) > ... 34 more > refreshNodes: Failed on local exception: java.io.IOException: > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)]; Host Details : local host is: "0.0.0.0/0.0.0.0"; > destination host is: "0.0.0.0":8033; > {quote} > h2. The Problem > When the client has: > * kerberos enabled by setting "hadoop.security.authentication = kerberos" in > "core-site.xml" > * a bad kerberos configuration where "yarn.resourcemanager.principal" is > unset or malformed in "yarn-site.xml" > This bad configuration can never successfully connect to the ResourceManager > & therefore should result in a non-retryable failure. > When the YARN ResourceManager client has this bad configuration an > IllegalArugmentException gets thrown (in > org.apache.hadoop.security.SaslRpcClient) but then swallowed by an > IOException (in org.apache.hadoop.ipc.Client) that gets treated as a > retryable failure & therefore will be retried for 15 minutes: > {quote}> yarn rmadmin -refreshNodes > 22/06/28 14:23:45 INFO client.RMProxy: Connecting to ResourceManager at > /0.0.0.0:8033 > 22/06/28 14:23:46 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:23:47 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:23:48 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 2 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > ... > 22/06/28 14:23:54 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 8 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:23:55 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:23:56 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:23:57 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:23:58 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 2 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > ... > 22/06/28 14:24:04 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 8 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:24:05 INFO ipc.Client: Retrying connect to server: > 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is > RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 > MILLISECONDS) > 22/06/28 14:24:05 INFO retry.RetryInvocationHandler: > java.net.ConnectException: Your endpoint configuration is wrong; For more > details see: [http://wiki.apache.org/hadoop/UnsetHostnameOrPort], while > invoking ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over > null after 1 failover attempts. Trying to failover after sleeping for 27166ms. > 22/06/28 14:24:32 INFO retry.RetryInvocationHandler: java.io.IOException: > Failed on local exception: java.io.IOException: Couldn't set up IO streams: > java.lang.IllegalArgumentException: Failed to specify server's Kerberos > principal name; Host Details : local host is: "0.0.0.0/0.0.0.0"; destination > host is: "0.0.0.0":8033; , while invoking > ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null > after 2 failover attempts. Trying to failover after sleeping for 22291ms. > 22/06/28 14:24:54 INFO retry.RetryInvocationHandler: java.io.IOException: > Failed on local exception: java.io.IOException: Couldn't set up IO streams: > java.lang.IllegalArgumentException: Failed to specify server's Kerberos > principal name; Host Details : local host is: "0.0.0.0/0.0.0.0"; destination > host is: "0.0.0.0":8033; , while invoking > ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null > after 3 failover attempts. Trying to failover after sleeping for 24773ms. > 22/06/28 14:25:19 INFO retry.RetryInvocationHandler: java.io.IOException: > Failed on local exception: java.io.IOException: Couldn't set up IO streams: > java.lang.IllegalArgumentException: Failed to specify server's Kerberos > principal name; Host Details : local host is: "0.0.0.0/0.0.0.0"; destination > host is: "0.0.0.0":8033; , while invoking > ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null > after 4 failover attempts. Trying to failover after sleeping for 39187ms. > ... > 22/06/28 14:36:50 INFO retry.RetryInvocationHandler: java.io.IOException: > Failed on local exception: java.io.IOException: Couldn't set up IO streams: > java.lang.IllegalArgumentException: Failed to specify server's Kerberos > principal name; Host Details : local host is: "0.0.0.0/0.0.0.0"; destination > host is: "0.0.0.0":8033; , while invoking > ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null > after 26 failover attempts. Trying to failover after sleeping for 26235ms. > 22/06/28 14:37:16 INFO retry.RetryInvocationHandler: java.io.IOException: > Failed on local exception: java.io.IOException: Couldn't set up IO streams: > java.lang.IllegalArgumentException: Failed to specify server's Kerberos > principal name; Host Details : local host is: "0.0.0.0/0.0.0.0"; destination > host is: "0.0.0.0":8033; , while invoking > ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null > after 27 failover attempts. Trying to failover after sleeping for 40535ms. > 22/06/28 14:37:57 INFO retry.RetryInvocationHandler: java.io.IOException: > Failed on local exception: java.io.IOException: Couldn't set up IO streams: > java.lang.IllegalArgumentException: Failed to specify server's Kerberos > principal name; Host Details : local host is: "0.0.0.0/0.0.0.0"; destination > host is: "0.0.0.0":8033; , while invoking > ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null > after 28 failover attempts. Trying to failover after sleeping for 26721ms. > 22/06/28 14:38:23 INFO retry.RetryInvocationHandler: java.io.IOException: > Failed on local exception: java.io.IOException: Couldn't set up IO streams: > java.lang.IllegalArgumentException: Failed to specify server's Kerberos > principal name; Host Details : local host is: "0.0.0.0/0.0.0.0"; destination > host is: "0.0.0.0":8033; , while invoking > ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null > after 29 failover attempts. Trying to failover after sleeping for 27641ms. > refreshNodes: Failed on local exception: java.io.IOException: Couldn't set up > IO streams: java.lang.IllegalArgumentException: Failed to specify server's > Kerberos principal name; Host Details : local host is: "0.0.0.0/0.0.0.0"; > destination host is: "0.0.0.0":8033; > {quote} > This non-retryable failure should not be treated as a retryable "connection > failure" > h2. The Solution > Surface the IllegalArgumentException to the RetryInvocationHandler & have > YARN RMProxy treat IllegalArugmentException as non-retryable > Note that surfacing IllegalArgumentException has the side-effect of causing > the [command usage to be printed > here|https://github.com/apache/hadoop/blob/c0bdba8face85fbd40f5d7ba46af11e24a8ef25b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/cli/RMAdminCLI.java#L790] > {quote}... > refreshNodes: Failed to specify server's Kerberos principal name > Usage: yarn rmadmin [-refreshNodes [-g|graceful [timeout in seconds] > -client|server]] > Generic options supported are: > -conf <configuration file> specify an application configuration file > -D <property=value> define a value for a given property > -fs <[file:///]|hdfs://namenode:port> specify default filesystem URL to use, > overrides 'fs.defaultFS' property from configurations. > -jt <local|resourcemanager:port> specify a ResourceManager > -files <file1,...> specify a comma-separated list of files to > be copied to the map reduce cluster > -libjars <jar1,...> specify a comma-separated list of jar files > to be included in the classpath > -archives <archive1,...> specify a comma-separated list of archives > to be unarchived on the compute machines > The general command line syntax is: > command [genericOptions] [commandOptions] > {quote} > To resolve this issue the IllegalArgumentException is swallowed & surfaced as > a KerberosAuthException, this chosen because it already gets [treated as > non-retryable in FailoverOnNetworkExceptionRetry]() > Note that in terms of RetryPolicy: > * non-HA YARN ResourceManager client should use > OtherThanRemoteExceptionDependentRetry (but because of a bug uses > FailoverOnNetworkExceptionRetry) > * HA YARN ResourceManager client uses FailoverOnNetworkExceptionRetry > The result of this change is a much quicker failure when the YARN client is > misconfigured: > * non-HA YARN ResourceManager client > > {quote}> yarn rmadmin -refreshNodes > 22/07/13 17:36:03 INFO client.RMProxy: Connecting to ResourceManager at > /0.0.0.0:8033 > 22/07/13 17:36:03 WARN ipc.Client: Exception encountered while connecting to > the server > javax.security.sasl.SaslException: Bad Kerberos server principal > configuration [Caused by java.lang.IllegalArgumentException: Failed to > specify server's Kerberos principal name] > at > org.apache.hadoop.security.SaslRpcClient.createSaslClient(SaslRpcClient.java:237) > at > org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:159) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:397) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:630) > at > org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:424) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:825) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:821) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:821) > at > org.apache.hadoop.ipc.Client$Connection.access$3700(Client.java:424) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1612) > at org.apache.hadoop.ipc.Client.call(Client.java:1442) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy7.refreshNodes(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes(ResourceManagerAdministrationProtocolPBClientImpl.java:145) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy8.refreshNodes(Unknown Source) > at > org.apache.hadoop.yarn.client.cli.RMAdminCLI.refreshNodes(RMAdminCLI.java:349) > at > org.apache.hadoop.yarn.client.cli.RMAdminCLI.refreshNodes(RMAdminCLI.java:423) > at > org.apache.hadoop.yarn.client.cli.RMAdminCLI.handleRefreshNodes(RMAdminCLI.java:917) > at > org.apache.hadoop.yarn.client.cli.RMAdminCLI.run(RMAdminCLI.java:816) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) > at > org.apache.hadoop.yarn.client.cli.RMAdminCLI.main(RMAdminCLI.java:1027) > Caused by: java.lang.IllegalArgumentException: Failed to specify server's > Kerberos principal name > at > org.apache.hadoop.security.SaslRpcClient.getServerPrincipal(SaslRpcClient.java:332) > at > org.apache.hadoop.security.SaslRpcClient.createSaslClient(SaslRpcClient.java:233) > ... 35 more > refreshNodes: Failed on local exception: java.io.IOException: > javax.security.sasl.SaslException: Bad Kerberos server principal > configuration [Caused by java.lang.IllegalArgumentException: Failed to > specify server's Kerberos principal name]; Host Details : local host is: > "0.0.0.0/0.0.0.0"; destination host is: "0.0.0.0":8033; > {quote} > * HA YARN ResourceManager client > > > {quote}> yarn rmadmin -refreshNodes > 22/07/13 17:37:50 INFO client.RMProxy: Connecting to ResourceManager at > /0.0.0.0:8033 > 22/07/13 17:37:50 WARN ipc.Client: Exception encountered while connecting to > the server > javax.security.sasl.SaslException: Bad Kerberos server principal > configuration [Caused by java.lang.IllegalArgumentException: Failed to > specify server's Kerberos principal name] > at > org.apache.hadoop.security.SaslRpcClient.createSaslClient(SaslRpcClient.java:237) > at > org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:159) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:397) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:630) > at > org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:424) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:825) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:821) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:821) > at > org.apache.hadoop.ipc.Client$Connection.access$3700(Client.java:424) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1612) > at org.apache.hadoop.ipc.Client.call(Client.java:1442) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy7.refreshNodes(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes(ResourceManagerAdministrationProtocolPBClientImpl.java:145) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy8.refreshNodes(Unknown Source) > at > org.apache.hadoop.yarn.client.cli.RMAdminCLI.refreshNodes(RMAdminCLI.java:349) > at > org.apache.hadoop.yarn.client.cli.RMAdminCLI.refreshNodes(RMAdminCLI.java:423) > at > org.apache.hadoop.yarn.client.cli.RMAdminCLI.handleRefreshNodes(RMAdminCLI.java:917) > at > org.apache.hadoop.yarn.client.cli.RMAdminCLI.run(RMAdminCLI.java:816) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) > at > org.apache.hadoop.yarn.client.cli.RMAdminCLI.main(RMAdminCLI.java:1027) > Caused by: java.lang.IllegalArgumentException: Failed to specify server's > Kerberos principal name > at > org.apache.hadoop.security.SaslRpcClient.getServerPrincipal(SaslRpcClient.java:332) > at > org.apache.hadoop.security.SaslRpcClient.createSaslClient(SaslRpcClient.java:233) > ... 35 more > refreshNodes: Failed on local exception: java.io.IOException: > javax.security.sasl.SaslException: Bad Kerberos server principal > configuration [Caused by java.lang.IllegalArgumentException: Failed to > specify server's Kerberos principal name]; Host Details : local host is: > "0.0.0.0/0.0.0.0"; destination host is: "0.0.0.0":8033; > {quote} > h2. Other Notes > The YARN RMProxy will return separate RetryPolicies for HA & non-HA, but the > YARN client will always use the HA policy because a configuration related to > [Federation > Failover|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java#L102] > is [enabled by > default|https://github.com/apache/hadoop/blob/e044a46f97dcc7998dc0737f15cf3956dca170c4/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java#L3901]. > This is presumably a bug because YARN Federation is not enabled for the > cluster I am testing on. > The fix is to modify HAUtil.isFederationFailoverEnabled to check if > "yarn.federation.enabled" (default false) in addition to checking if > "yarn.federation.failover.enabled" (default true). -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org