KevinWikant opened a new pull request, #4563:
URL: https://github.com/apache/hadoop/pull/4563

   ### Description of PR
   
   Applications which call YARN RMAdminCLI (i.e. YARN ResourceManager client) 
synchronously can be blocked for up to 15 minutes with the default 
configuration of "yarn.resourcemanager.connect.max-wait.ms"; this is not an 
issue in of itself, but there is a non-retryable IllegalArgumentException 
exception thrown within the YARN ResourceManager client that is getting 
swallowed & treated as a retryable "connection exception" meaning that it gets 
retried for 15 minutes.
   
   The purpose of this JIRA (and PR) is to modify the YARN client so that it 
does not retry on this non-retryable exception.
   
   See JIRA for additional details: 
https://issues.apache.org/jira/browse/YARN-11210
   
   ### How was this patch tested?
   
   - Create Kerberized YARN cluster
   
   - Run YARN rmadmin client & validate it completes successfully
   
   ```
   > yarn rmadmin -refreshNodes;
   
   22/07/13 15:30:45 INFO client.RMProxy: Connecting to ResourceManager at 
/0.0.0.0:8033
   
   >> Success is silent in client logs, but can be seen in the ResourceManager 
logs <<
   ```
   
   - Unset the value of "yarn.resourcemanager.principal" is "yarn-site.xml"
   
   - Run YARN rmadmin client & validate it retries for 15 minutes
   
   ```
   > yarn rmadmin -refreshNodes;
   
   22/06/28 14:23:45 INFO client.RMProxy: Connecting to ResourceManager at 
/0.0.0.0:8033
   22/06/28 14:23:46 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
   ...
   22/06/28 14:23:55 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
   22/06/28 14:23:56 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
   ...
   22/06/28 14:24:05 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
   22/06/28 14:24:05 INFO retry.RetryInvocationHandler: 
java.net.ConnectException: Your endpoint configuration is wrong; For more 
details see:  http://wiki.apache.org/hadoop/UnsetHostnameOrPort, while invoking 
ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null after 
1 failover attempts. Trying to failover after sleeping for 27166ms.
   22/06/28 14:24:32 INFO retry.RetryInvocationHandler: java.io.IOException: 
Failed on local exception: java.io.IOException: Couldn't set up IO streams: 
java.lang.IllegalArgumentException: Failed to specify server's Kerberos 
principal name; Host Details : local host is: "0.0.0.0/0.0.0.0"; destination 
host is: "0.0.0.0":8033; , while invoking 
ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null after 
2 failover attempts. Trying to failover after sleeping for 22291ms.
   ...
   22/06/28 14:37:57 INFO retry.RetryInvocationHandler: java.io.IOException: 
Failed on local exception: java.io.IOException: Couldn't set up IO streams: 
java.lang.IllegalArgumentException: Failed to specify server's Kerberos 
principal name; Host Details : local host is: "0.0.0.0/0.0.0.0"; destination 
host is: "0.0.0.0":8033; , while invoking 
ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null after 
28 failover attempts. Trying to failover after sleeping for 26721ms.
   22/06/28 14:38:23 INFO retry.RetryInvocationHandler: java.io.IOException: 
Failed on local exception: java.io.IOException: Couldn't set up IO streams: 
java.lang.IllegalArgumentException: Failed to specify server's Kerberos 
principal name; Host Details : local host is: "0.0.0.0/0.0.0.0"; destination 
host is: "0.0.0.0":8033; , while invoking 
ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null after 
29 failover attempts. Trying to failover after sleeping for 27641ms.
   refreshNodes: Failed on local exception: java.io.IOException: Couldn't set 
up IO streams: java.lang.IllegalArgumentException: Failed to specify server's 
Kerberos principal name; Host Details : local host is: "0.0.0.0/0.0.0.0"; 
destination host is: "0.0.0.0":8033;
   ```
   
   - Modify YARN client runtime classpath to contain the changes in this PR
   
   - Run YARN rmadmin client & validate it fails after 1 try (tested in both 
federation enabled & non-federation enabled clusters)
   
   ```
   > yarn rmadmin -refreshNodes;
   
   22/07/13 17:37:50 INFO client.RMProxy: Connecting to ResourceManager at 
/0.0.0.0:8033
   22/07/13 17:37:50 WARN ipc.Client: Exception encountered while connecting to 
the server
   javax.security.sasl.SaslException: Bad Kerberos server principal 
configuration [Caused by java.lang.IllegalArgumentException: Failed to specify 
server's Kerberos principal name]
           at 
org.apache.hadoop.security.SaslRpcClient.createSaslClient(SaslRpcClient.java:237)
           at 
org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:159)
           at 
org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:397)
           at 
org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:630)
           at 
org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:424)
           at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:825)    
    at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:821)
           at java.security.AccessController.doPrivileged(Native Method)
           at javax.security.auth.Subject.doAs(Subject.java:422)
           at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
           at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:821)
           at 
org.apache.hadoop.ipc.Client$Connection.access$3700(Client.java:424)
           at org.apache.hadoop.ipc.Client.getConnection(Client.java:1612)
           at org.apache.hadoop.ipc.Client.call(Client.java:1442)
           at org.apache.hadoop.ipc.Client.call(Client.java:1395)
           at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
           at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
           at com.sun.proxy.$Proxy7.refreshNodes(Unknown Source)
           at 
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes(ResourceManagerAdministrationProtocolPBClientImpl.java:145)
           at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
           at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
           at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
           at java.lang.reflect.Method.invoke(Method.java:498)
           at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
           at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
           at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
           at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
           at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
           at com.sun.proxy.$Proxy8.refreshNodes(Unknown Source)
           at 
org.apache.hadoop.yarn.client.cli.RMAdminCLI.refreshNodes(RMAdminCLI.java:349)
           at 
org.apache.hadoop.yarn.client.cli.RMAdminCLI.refreshNodes(RMAdminCLI.java:423)
           at 
org.apache.hadoop.yarn.client.cli.RMAdminCLI.handleRefreshNodes(RMAdminCLI.java:917)
           at 
org.apache.hadoop.yarn.client.cli.RMAdminCLI.run(RMAdminCLI.java:816)
           at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
           at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
           at 
org.apache.hadoop.yarn.client.cli.RMAdminCLI.main(RMAdminCLI.java:1027)
   Caused by: java.lang.IllegalArgumentException: Failed to specify server's 
Kerberos principal name
           at 
org.apache.hadoop.security.SaslRpcClient.getServerPrincipal(SaslRpcClient.java:332)
           at 
org.apache.hadoop.security.SaslRpcClient.createSaslClient(SaslRpcClient.java:233)
           ... 35 more
   refreshNodes: Failed on local exception: java.io.IOException: 
javax.security.sasl.SaslException: Bad Kerberos server principal configuration 
[Caused by java.lang.IllegalArgumentException: Failed to specify server's 
Kerberos principal name]; Host Details : local host is: "0.0.0.0/0.0.0.0"; 
destination host is: "0.0.0.0":8033;
   ```
   
   ### For code changes:
   
   - [x] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [n/a] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [n/a] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [n/a] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to