[ https://issues.apache.org/jira/browse/YARN-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Naganarasimha G R updated YARN-6029: ------------------------------------ Priority: Blocker (was: Major) > CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by > Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to > release a reserved container > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: YARN-6029 > URL: https://issues.apache.org/jira/browse/YARN-6029 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler > Affects Versions: 2.8.0 > Reporter: Tao Yang > Assignee: Tao Yang > Priority: Blocker > Attachments: YARN-6029.001.patch, deadlock.jstack > > > When ParentQueue#getQueueUserAclInfo is called (e.g. a client calls > YarnClient#getQueueAclsInfo) just at the moment that > LeafQueue#assignContainers is called and before notifying parent queue to > release resource (should release a reserved container), then ResourceManager > can deadlock. I found this problem on our testing environment for hadoop2.8. > Reproduce the deadlock in chronological order > * 1. Thread A (ResourceManager Event Processor) calls synchronized > LeafQueue#assignContainers (got LeafQueue instance lock of queue root.a) > * 2. Thread B (IPC Server handler) calls synchronized > ParentQueue#getQueueUserAclInfo (got ParentQueue instance lock of queue > root), iterates over children queue acls and is blocked when calling > synchronized LeafQueue#getQueueUserAclInfo (the LeafQueue instance lock of > queue root.a is hold by Thread A) > * 3. Thread A wants to inform the parent queue that a container is being > completed and is blocked when invoking synchronized > ParentQueue#internalReleaseResource method (the ParentQueue instance lock of > queue root is hold by Thread B) > I think the synchronized modifier of LeafQueue#getQueueUserAclInfo can be > removed to solve this problem, since this method appears to not affect fields > of LeafQueue instance. > Attach patch with UT for review. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org