[ https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140295#comment-16140295 ]
Arun Suresh edited comment on YARN-7086 at 8/24/17 4:51 PM: ------------------------------------------------------------ Thanks for chiming in folks. And yes, I agree with [~jlowe] too. To move forward, and if everyone if fine with the approach, I will post a patch that does the following: * Introduce a *RELEASE_CONTAINERS* scheduler event : will refactor the existing RELEASE_CONTAINER event to take multiple containers. * Will expose an aysnc release method in the AbstractYarnScheduler that takes a list of containers, will split the list into some (configured ?) max containers released at a time, and will send an event for each the sub-list. * Route all calls to release containers from both the scheduler to the new API. Currently, the problematic ones are during app attempt complete, node removed and the schedulers's handling of AM's explicit release containers. was (Author: asuresh): Thanks for chiming in folks. And yes, I agree with [~jlowe] too. To move forward, and if everyone if fine with the approach, I will post a patch that does the following: * Introduce a *RELEASE_CONTAINERS* scheduler event : will refactor the existing RELEASE_CONTAINER event to take multiple containers. * Will expose and aysnc release method in the AbstractYarnScheduler to takes a list of containers, will split the list into some (configured ?) max containers released at a time, and will send an event for each the sub-list. * Route all calls to release containers from both the scheduler to the new API. Currently, the problematic ones are during app attempt complete, node removed and the schedulers's handling of AM's explicit release containers. > Release all containers aynchronously > ------------------------------------ > > Key: YARN-7086 > URL: https://issues.apache.org/jira/browse/YARN-7086 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Reporter: Arun Suresh > Assignee: Arun Suresh > > We have noticed in production two situations that can cause deadlocks and > cause scheduling of new containers to come to a halt, especially with regard > to applications that have a lot of live containers: > # When these applicaitons release these containers in bulk. > # When these applications terminate abruptly due to some failure, the > scheduler releases all its live containers in a loop. > To handle the issues mentioned above, we have a patch in production to make > sure ALL container releases happen asynchronously - and it has served us well. > Opening this JIRA to gather feedback on if this is a good idea generally (cc > [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd]) > BTW, In YARN-6251, we already have an asyncReleaseContainer() in the > AbstractYarnScheduler and a corresponding scheduler event, which is currently > used specifically for the container-update code paths (where the scheduler > realeases temp containers which it creates for the update) -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org