Tao Yang created YARN-9423: ------------------------------ Summary: Optimize AM launcher to avoid bottleneck when a large number of AM failover happen at the same time Key: YARN-9423 URL: https://issues.apache.org/jira/browse/YARN-9423 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 3.2.0 Reporter: Tao Yang Assignee: Tao Yang
We have met a slow recovery for applications when many NM lost happen at the same time: # many NM shut down at the same time abnormally. # NM expired, then a large number of AM start failover. # AM containers are allocated but not launched for about half an hour. Among this slow recovery, all ApplicationMasterLauncher threads were calling cleanup for containers on these lost nodes and keep retrying to communicate with NM for 3 minutes(retry policy is configured in NMProxy#createNMProxy) even though RM had known these NM are lost and probably can't be connected for a long time. Meanwhile many AM cleanup and launch operations were still waiting in queue (ApplicationMasterLauncher#masterEvents). Obviously AM launch operations were blocked by cleanup operations which are wasting 3 minutes. As a result, AM failover can be a very slow journey. I think we can optimize AM launcher in two ways: # Modify type of ApplicationMasterLauncher#masterEvents from LinkedBlockingQueue to PriorityBlockingQueue, keep executing launch operations in front of cleanup operations. # Check node state first and skip cleanup AM containers on non-existent or unusable NM (because these NM probably can't be communicated for a long time) before communicating with NM in cleanup process(AMLauncher#cleanup). -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org