[ https://issues.apache.org/jira/browse/YARN-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14383469#comment-14383469 ]
Peng Zhang commented on YARN-3405: ---------------------------------- And there is another related case which will cause live lock during preemption and scheduling. If necessary, I will create a separated issue for it. Queue hierarchy described as below: {noformat} root / | \ queue-1 queue-2 queue-3 / \ queue-1-1 queue-1-2 {noformat} # Assume cluster resource is 100G in memory # Assume queue-1 has max resource limit 20G # queue-1-1 is active and it will get max 20G memory(equal to its fairshare) # queue-2 is active then, and it require 30G memory(less than its fairshare) # queue-3 is active, and it can be assigned with all other resources, 50G memory(larger than its fairshare) # queue-1-2 is active, it will cause new preemption request(10G memory and intuitively it can only preempt from its sibling queue-1-1) # Actually preemption starts from root, and it will find queue-3 is most over fairshare, and preempt some resources form queue-3. # But during scheduling, it will find queue-1 itself arrived it's max fairshare, and cannot assign resource to it. Then resource's again assigned to queue-3 And then it repeats between last two steps. > FairScheduler's preemption cannot happen between sibling in some case > --------------------------------------------------------------------- > > Key: YARN-3405 > URL: https://issues.apache.org/jira/browse/YARN-3405 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler > Affects Versions: 2.7.0 > Reporter: Peng Zhang > Priority: Critical > > Queue hierarchy described as below: > {noformat} > root > | > queue-1 > / \ > queue-1-1 queue-1-2 > {noformat} > 1. When queue-1-1 is active and it has been assigned with all resources. > 2. When queue-1-2 is active, and it cause some new preemption request. > 3. But when do preemption, it now starts from root, and found queue-1 is not > over fairshare, so no recursion preemption to queue-1-1. > 4. Finally queue-1-2 will be waiting for resource release form queue-1-1 > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)