[ https://issues.apache.org/jira/browse/YARN-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15140526#comment-15140526 ]
Rohith Sharma K S commented on YARN-4685: ----------------------------------------- One of the case where application got stuck is # Cluster started with 2 Node initially and submitted 1 application. # Attempt 1 is failed with disk failed in NM-1. Attempt-2 got created making NM-1 as blacklisted node. # NM-2 got removed from cluster. Only NM-1 is in cluster. # Since NM-1 is blacklisted, no more containers are assigned to NM-1. # In cluster only 1 node is there and that too blacklisted, so no more container are assigning to NM-1 even after Node NM-1 is reconnected after removing disk space. > AM blacklist addition/removal should get updated for every allocate call from > RMAppAttemptImpl. > ----------------------------------------------------------------------------------------------- > > Key: YARN-4685 > URL: https://issues.apache.org/jira/browse/YARN-4685 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.8.0 > Reporter: Rohith Sharma K S > Assignee: Rohith Sharma K S > > AM blacklist addition or removal is updated only when RMAppAttempt is > scheduled i.e {{RMAppAttemptImpl#ScheduleTransition#transition}}. But once > attempt is scheduled if there is any removeNode/addNode in cluster then this > is not updated to {{BlackListManager#refreshNodeHostCount}}. This leads > BlackListManager to operate on stale NM's count. And application is in > ACCEPTED state and wait forever even if we add more nodes to cluster. > Solution is update BlacklistManager for every > {{RMAppAttemptImpl#AMContainerAllocatedTransition#transition}} call. This > ensures if there is any addition/removal in nodes, this will be updated to > BlacklistManager -- This message was sent by Atlassian JIRA (v6.3.4#6332)