[ https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17840457#comment-17840457 ]
ASF GitHub Bot commented on YARN-11191: --------------------------------------- tomicooler opened a new pull request, #6769: URL: https://github.com/apache/hadoop/pull/6769 … (#6732) (cherry picked from commit ecf665c6facf89d3b87b6e3cc684274b8155ca60) Change-Id: I561bcad51af7810328c8b91cd9290d5198be0c6e <!-- Thanks for sending a pull request! 1. If this is your first time, please read our contributor guidelines: https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute 2. Make sure your PR title starts with JIRA issue id, e.g., 'HADOOP-17799. Your PR title ...'. --> ### Description of PR Backport, there were conflicts (abstractparent/leaf queue, and queuepath doesn't exist here yet). Jira: [YARN-11191](https://issues.apache.org/jira/browse/YARN-11191) Original PR: #6732 ### For code changes: - [ ] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? - [ ] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? > Global Scheduler refreshQueue cause deadLock > --------------------------------------------- > > Key: YARN-11191 > URL: https://issues.apache.org/jira/browse/YARN-11191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler > Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0 > Reporter: ben yang > Assignee: Tamas Domok > Priority: Major > Labels: pull-request-available > Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch > > > This is a potential bug may impact all open premmption cluster.In our > current version with preemption enabled, the capacityScheduler will call the > refreshQueue method of the PreemptionManager when it refreshQueue. This > process hold the preemptionManager write lock and require csqueue read > lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock > and require PreemptionManager ReadLock. > There is a possibility of deadlock at this time.Because readlock has one rule > on unfair policy, when a lock is already occupied by a read lock and the > first request in the lock competition queue is a write lock request,other > read lock requests cann‘t acquire the lock. > So the potential deadlock is: > {code:java} > CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock > require: csqueue.readLock > CapacityScheduler.schedule: hold: csqueue.readLock > require: PremmptionManager.readLock > other thread(completeContainer,release Resource,etc.): require: > csqueue.writeLock > {code} > The jstack logs at the time were as follows -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org