[ https://issues.apache.org/jira/browse/YARN-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16573997#comment-16573997 ]
Zian Chen edited comment on YARN-8509 at 8/10/18 9:21 PM: ---------------------------------------------------------- Hi Eric, thanks for the comments. Discussed with Wangda, the patch uploaded before is not correct due to misunderstand of the original problem. I have changed the Jira title. The intention of this Jira is to fix calculation of pending resource consider user-limit in preemption scenario. Currently, pending resource calculation in preemption uses the calculation algorithm in scheduling which is this one, {code:java} user_limit = min(max(current_capacity)/ #active_users, current_capacity * user_limit_percent), queue_capacity * user_limit_factor) {code} this is good for scheduling cause we want to make sure users can get at least "minimum-user-limit-percent" of resource to use, which is more like a lower bound of user-limit. However we should not capture total pending resource a leaf queue can get by minimum-user-limit-percent, instead, we want to use user-limit-factor which is the upper bound to capture pending resource in preemption. Cause if we use minimum-user-limit-percent to capture pending resource, resource under-utilization will happen in preemption scenario. Thus, we suggest the pending resource calculation for preemption should use this formula. {code:java} total_pending(partition,queue) = min {Q_max(partition) - Q_used(partition), Σ (min { User.ulf(partition) - User.used(partition), User.pending(partition})} {code} Let me give an example, {code:java} Root / | \ \ a b c d 30 30 30 10 1) Only one node (n1) in the cluster, it has 100G. 2) app1 submit to queue-a, asks for 10G used, 6G pending. 3) app2 submit to queue-b, asks for 40G used, 30G pending. 4) app3 submit to queue-c, asks for 50G used, 30G pending. {code} Here we only have one user, and user-limit-factor for queues are ||Queue name|| minimum-user-limit-percent ||user-limit-factor|| | a| 50| 1.0 f| | b| 50| 3.0 f| | c| 50| 3.0 f| | d| 50| 2.0 f| With old calculation, user-limit for queue-a is 30G, which can let app1 has 6G pending, but user-limit for queue-b becomes 40G, which makes headroom become zero after subtract 40G used, the 30G pending resource been asked can not be accepted, same thing with queue-c too. However if we see this test case in preemption point of view, we should allow queue-b and queue-c take more pending resources. Because even though queue-a has 30G guaranteed configured, it's under utilization. And by pending resource captured by the old algorithm, queue-b and queue-c can not take available resource through preemption which make the cluster resource not used effectively. To summarize, since user-limit-factor maintains the hard-limit of how much resource can be used by a user, we should calculate pending resource consider user-limit-factor instead of minimum-user-limit-percent. Could you share your opinion on this, [~eepayne]? was (Author: zian chen): Hi Eric, thanks for the comments. Discussed with Wangda, the patch uploaded before is not correct due to misunderstand of the original problem. I have changed the Jira title. The intention of this Jira is to fix calculation of pending resource consider user-limit in preemption scenario. Currently, pending resource calculation in preemption uses the calculation algorithm in scheduling which is this one, {code:java} user_limit = min(max(current_capacity)/ #active_users, current_capacity * user_limit_percent), queue_capacity * user_limit_factor) {code} this is good for scheduling cause we want to make sure users can get at least "minimum-user-limit-percent" of resource to use, which is more like a lower bound of user-limit. However we should not capture total pending resource a leaf queue can get by minimum-user-limit-percent, instead, we want to use user-limit-factor which is the upper bound to capture pending resource in preemption. Cause if we use minimum-user-limit-percent to capture pending resource, resource under-utilization will happen in preemption scenario. Thus, we suggest the pending resource calculation for preemption should use this formula. {code:java} total_pending(partition,queue) = min {Q_max(partition) - Q_used(partition), Σ (min { User.ulf(partition) - User.used(partition), User.pending(partition})} {code} Let me give an example, {code:java} Root / | \ \ a b c d 30 30 30 10 1) Only one node (n1) in the cluster, it has 100G. 2) app1 submit to queue-a, asks for 10G used, 6G pending. 3) app2 submit to queue-b, asks for 40G used, 30G pending. 4) app3 submit to queue-c, asks for 50G used, 30G pending. {code} Here we only have one user, and user-limit-factor for queues are ||Queue name|| minimum-user-limit-percent ||user-limit-factor|| | a| 1| 1.0 f| | b| 1| 2.0 f| | c| 1| 2.0 f| | d| 1| 2.0 f| With old calculation, user-limit for queue-a is 30G, which can let app1 has 6G pending, but user-limit for queue-b becomes 40G, which makes headroom become zero after subtract 40G used, the 30G pending resource been asked can not be accepted, same thing with queue-c too. However if we see this test case in preemption point of view, we should allow queue-b and queue-c take more pending resources. Because even though queue-a has 30G guaranteed configured, it's under utilization. And by pending resource captured by the old algorithm, queue-b and queue-c can not take available resource through preemption which make the cluster resource not used effectively. To summarize, since user-limit-factor maintains the hard-limit of how much resource can be used by a user, we should calculate pending resource consider user-limit-factor instead of minimum-user-limit-percent. Could you share your opinion on this, [~eepayne]? > Total pending resource calculation in preemption should use user-limit factor > instead of minimum-user-limit-percent > ------------------------------------------------------------------------------------------------------------------- > > Key: YARN-8509 > URL: https://issues.apache.org/jira/browse/YARN-8509 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn > Reporter: Zian Chen > Assignee: Zian Chen > Priority: Major > Attachments: YARN-8509.001.patch, YARN-8509.002.patch, > YARN-8509.003.patch > > > In LeafQueue#getTotalPendingResourcesConsideringUserLimit, we calculate total > pending resource based on user-limit percent and user-limit factor which will > cap pending resource for each user to the minimum of user-limit pending and > actual pending. This will prevent queue from taking more pending resource to > achieve queue balance after all queue satisfied with its ideal allocation. > > We need to change the logic to let queue pending can go beyond userlimit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org