[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

ASF GitHub Bot (Jira) Fri, 12 Aug 2022 01:03:06 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578851#comment-17578851
 ]


ASF GitHub Bot commented on YARN-11191:
---------------------------------------

luoyuan3471 commented on PR #4726:
URL: https://github.com/apache/hadoop/pull/4726#issuecomment-1212840341

   > @luoyuan3471 1.The key to deadlock is that refresh thread can‘t acquire 
csqueue read lock. The read lock request is blocked by a write lock (as: 
https://bugs.openjdk.org/browse/JDK-6893626).so i use tryLock to break the 
condition.The PremmptionManager lock will be released soon after refresh thread 
gets csqueue read lock. 2.just preemption, but global scheduler increases the 
chance
   
   For 2, Why does Global Scheduler increase the chance of  this dead lock case?




> Global Scheduler refreshQueue cause deadLock 
> ---------------------------------------------
>
>                 Key: YARN-11191
>                 URL: https://issues.apache.org/jira/browse/YARN-11191
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler
>    Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
>            Reporter: ben yang
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: 1.jstack, Lock holding status.png, YARN-11191.001.patch
>
>
> This is a potential bug may impact all open premmption  cluster.In our 
> current version with preemption enabled, the capacityScheduler will call the 
> refreshQueue method of the PreemptionManager when it refreshQueue. This 
> process hold the preemptionManager write lock and  require csqueue read 
> lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock 
> and require PreemptionManager ReadLock.
> There is a possibility of deadlock at this time.Because readlock has one rule 
> on unfair policy, when a lock is already occupied by a read lock and the 
> first request in the lock competition queue is a write lock request,other 
> read lock requests cann‘t acquire the lock.
> So the potential deadlock is:
> {code:java}
> CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
>                                 require: csqueue.readLock
> CapacityScheduler.schedule: hold: csqueue.readLock
>                             require: PremmptionManager.readLock
> other thread(completeContainer,release Resource,etc.): require: 
> csqueue.writeLock 
> {code}
> The jstack logs at the time were as follows



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11191) Global Scheduler refreshQueue cause deadLock

Reply via email to