Samrat002 opened a new pull request, #8208:
URL: https://github.com/apache/hadoop/pull/8208

   <!--
     Thanks for sending a pull request!
       1. If this is your first time, please read our contributor guidelines: 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
       2. Make sure your PR title starts with JIRA issue id, e.g., 
'HADOOP-17799. Your PR title ...'.
   -->
   
   ### Description of PR
   When hadoop cluster running on cloud , uses spot instance and AM is launched 
on one of those instances. When these instances are removed then we have 
observed too many AM Launch Failures due to Token Expired or Container 
Liveliness Expiry when AM Launch Threads are busy retrying to connect to AM 
Host (Spot Instances) which are down. Having Separate ThreadPools for both 
Cleanup and Launch will reduce the AM Launch failures.
   
   ### Token Expired
   
   ```
   2022-07-19 14:56:33,486 ERROR 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl 
(IPC Server handler 39 on 8041): Unauthorized request to start container.
   This token is expired. current time is 1658242593486 found 1658242289457
   Note: System times on machines may be out of sync. Check system time and 
time zones.
   ```
   
   ### Container Liveliness Expiry
   
   ```
   2022-07-19 16:06:48,663 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl 
(ResourceManager Event Processor): container_xxxxxxxxxxxxx_xxxxxxx_xx_000001 
Container Transitioned from ACQUIRED to EXPIRED
   
   2022-07-19 16:10:08,663 INFO 
org.apache.hadoop.yarn.util.AbstractLivelinessMonitor (Ping Checker): 
Expired:<container=container_xxxxxxxxxxxxx_xxxxxxx_xx_000001, increase=false> 
Timed out after 600 secs
   ```
   
   Associated ticket :- 
[YARN-11251](https://issues.apache.org/jira/browse/YARN-11251)
   
   
   ### How was this patch tested?
   This patch is tested in EMR cluster where 1 master node and 1 core nodes , 
and 2 tasks nodes , task nodes are spot instances , we launched an AM in one of 
the task node and bring it down , This replicate the following senerio 
   
   TODO :- unit test need to be added 
   ### For code changes:
   
   - [x] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [x] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [x] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [x] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to