Tao Yang created YARN-9423:
------------------------------

             Summary: Optimize AM launcher to avoid bottleneck when a large 
number of AM failover happen at the same time
                 Key: YARN-9423
                 URL: https://issues.apache.org/jira/browse/YARN-9423
             Project: Hadoop YARN
          Issue Type: Improvement
          Components: resourcemanager
    Affects Versions: 3.2.0
            Reporter: Tao Yang
            Assignee: Tao Yang


We have met a slow recovery for applications when many NM lost happen at the 
same time:
 # many NM shut down at the same time abnormally.
 # NM expired, then a large number of AM start failover.
 # AM containers are allocated but not launched for about half an hour.

Among this slow recovery, all ApplicationMasterLauncher threads were calling 
cleanup for containers on these lost nodes and keep retrying to communicate 
with NM for 3 minutes(retry policy is configured in NMProxy#createNMProxy) even 
though RM had known these NM are lost and probably can't be connected for a 
long time. Meanwhile many AM cleanup and launch operations were still waiting 
in queue (ApplicationMasterLauncher#masterEvents). Obviously AM launch 
operations were blocked by cleanup operations which are wasting 3 minutes. As a 
result, AM failover can be a very slow journey.

I think we can optimize AM launcher in two ways:
 # Modify type of ApplicationMasterLauncher#masterEvents from 
LinkedBlockingQueue to PriorityBlockingQueue, keep executing launch operations 
in front of cleanup operations.
 # Check node state first and skip cleanup AM containers on non-existent or 
unusable NM (because these NM probably can't be communicated for a long time) 
before communicating with NM in cleanup process(AMLauncher#cleanup).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to