[ 
https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590507#comment-16590507
 ] 

Manikandan R edited comment on YARN-7086 at 8/23/18 4:49 PM:
-------------------------------------------------------------

Thanks [~asuresh]

Attached .001 patch for early review. It has changes as described in 
https://issues.apache.org/jira/browse/YARN-7086?focusedCommentId=16140295&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16140295.

[~jlowe]

{quote}I think it would be a lot better if there was a bulk-release interface 
so we could grab the critical lock once.{quote}

I assume you are referring the lock inside LeafQueue#completedContainer(). If 
answer is yes, one approach would be doing changes in 
Scheduler#completedContainer(), Scheduler#completedContainerInternal() and 
LeafQueue#completedContainer() to accept list of containers and process 
accordingly as opposed to accepting single container. Currently, All these 
methods accepts single RMContainer and do the operation with respect to that. 
With this new approach, We will need to see how we can able to accept list and 
traverse accordingly. Can you please confirm this?


was (Author: maniraj...@gmail.com):
Attached .001 patch for early review. It has changes as described in 
https://issues.apache.org/jira/browse/YARN-7086?focusedCommentId=16140295&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16140295.

[~jlowe]

{quote}I think it would be a lot better if there was a bulk-release interface 
so we could grab the critical lock once.{quote}

I assume you are referring the lock inside LeafQueue#completedContainer(). If 
answer is yes, one approach would be doing changes in 
Scheduler#completedContainer(), Scheduler#completedContainerInternal() and 
LeafQueue#completedContainer() to accept list of containers and process 
accordingly as opposed to accepting single container. Currently, All these 
methods accepts single RMContainer and do the operation with respect to that. 
With this new approach, We will need to see how we can able to accept list and 
traverse accordingly. Can you please confirm this?

> Release all containers aynchronously
> ------------------------------------
>
>                 Key: YARN-7086
>                 URL: https://issues.apache.org/jira/browse/YARN-7086
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: Arun Suresh
>            Assignee: Manikandan R
>            Priority: Major
>         Attachments: YARN-7086.001.patch
>
>
> We have noticed in production two situations that can cause deadlocks and 
> cause scheduling of new containers to come to a halt, especially with regard 
> to applications that have a lot of live containers:
> # When these applicaitons release these containers in bulk.
> # When these applications terminate abruptly due to some failure, the 
> scheduler releases all its live containers in a loop.
> To handle the issues mentioned above, we have a patch in production to make 
> sure ALL container releases happen asynchronously - and it has served us well.
> Opening this JIRA to gather feedback on if this is a good idea generally (cc 
> [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd])
> BTW, In YARN-6251, we already have an asyncReleaseContainer() in the 
> AbstractYarnScheduler and a corresponding scheduler event, which is currently 
> used specifically for the container-update code paths (where the scheduler 
> realeases temp containers which it creates for the update)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to