[jira] [Updated] (YARN-7619) Max AM Resource value in CS UI is different for every user
[ https://issues.apache.org/jira/browse/YARN-7619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-7619: - Attachment: YARN-7619.001.patch Uploading patch 001. This is not a perfect solution, but it's close. The pre-weighted AM limit for all users in a particular queue is calculated in {{LeafQueue#getUserAMResourceLimitPerPartition}} and passed to the UI via the {{UserInfo}} object for each user when the UI is rendered. This is a little awkward because the AM Limit for users in the queue is a per-queue value, but when rendering, I wanted to multiply the value by each users' weight. The value displayed on the UI in the Max AM Resource may not always be valid for weighted users because it is not normalized, and it may be more than the queue-level AM limit on the high end if the weight is large. But since this is only for display purposes, I think it's acceptable. > Max AM Resource value in CS UI is different for every user > -- > > Key: YARN-7619 > URL: https://issues.apache.org/jira/browse/YARN-7619 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.9.0, 3.0.0-beta1, 2.8.2, 3.1.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: Max AM Resources is Different for Each User.png, > YARN-7619.001.patch > > > YARN-7245 addressed the problem that the {{Max AM Resource}} in the capacity > scheduler UI used to contain the queue-level AM limit instead of the > user-level AM limit. It fixed this by using the user-specific AM limit that > is calculated in {{LeafQueue#activateApplications}}, stored in each user's > {{LeafQueue#User}} object, and retrieved via > {{UserInfo#getResourceUsageInfo}}. > The problem is that this user-specific AM limit depends on the activity of > other users and other applications in a queue, and it is only calculated and > updated when a user's application is activated. So, when > {{CapacitySchedulerPage}} retrieves the user-specific AM limit, it is a stale > value unless an application was recently activated for a particular user. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7619) Max AM Resource value in CS UI is different for every user
[ https://issues.apache.org/jira/browse/YARN-7619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282707#comment-16282707 ] Eric Payne commented on YARN-7619: -- All of the other solutions I could think of seem undesirable. One solution would is to have {{LeafQueue}} remember the last user for which it activated an application. The resource usages for that user are passed through the {{UserInfo}} object to {{CapacitySchedulerPage}}, which then extracts the last activated user's AM limit from those usages. This is not ideal, because it doesn't take into account user weights. So, if the last activated user has a weight not equal to 1.0, the AM limit may be wrong for some users. (_On a side note, user weights do not look to be affecting user AM limits even though {{LeafQueue#getUserAMResourceLimitPerPartition}} seems to be computing the limit using user weights_). Also, if the last activated user leaves the queue, we have to use each users' AM limit, which puts us back where we started. Another solution may be to have {{UsersManager}} sort the users list to be in last-activated-first order. Then, when {{CapacitySchedulerPage#QueueUsersInfoBlock}} is rendering the users info block, it could just get the user AM limit from the first uesr. That's what {{CapacitySchedulerPage#LeafQueueInfoBlock}} does when it's retrieving the value for *Max Application Master Resources Per User*. It just expects the first one to be the correct one for all the users in the queue. Ideally, I would say it would be best to save the recomputed user AM limit to all users objects whenever {{LeafQueue#getUserAMResourceLimitPerPartition}} is called, but that may cause a significant performance hit. Even so, I think this option is the cleanest and the performance hit may not be that bad. > Max AM Resource value in CS UI is different for every user > -- > > Key: YARN-7619 > URL: https://issues.apache.org/jira/browse/YARN-7619 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.9.0, 3.0.0-beta1, 2.8.2, 3.1.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: Max AM Resources is Different for Each User.png > > > YARN-7245 addressed the problem that the {{Max AM Resource}} in the capacity > scheduler UI used to contain the queue-level AM limit instead of the > user-level AM limit. It fixed this by using the user-specific AM limit that > is calculated in {{LeafQueue#activateApplications}}, stored in each user's > {{LeafQueue#User}} object, and retrieved via > {{UserInfo#getResourceUsageInfo}}. > The problem is that this user-specific AM limit depends on the activity of > other users and other applications in a queue, and it is only calculated and > updated when a user's application is activated. So, when > {{CapacitySchedulerPage}} retrieves the user-specific AM limit, it is a stale > value unless an application was recently activated for a particular user. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7619) Max AM Resource value in CS UI is different for every user
[ https://issues.apache.org/jira/browse/YARN-7619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-7619: - Attachment: Max AM Resources is Different for Each User.png > Max AM Resource value in CS UI is different for every user > -- > > Key: YARN-7619 > URL: https://issues.apache.org/jira/browse/YARN-7619 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.9.0, 3.0.0-beta1, 2.8.2, 3.1.0 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: Max AM Resources is Different for Each User.png > > > YARN-7245 addressed the problem that the {{Max AM Resource}} in the capacity > scheduler UI used to contain the queue-level AM limit instead of the > user-level AM limit. It fixed this by using the user-specific AM limit that > is calculated in {{LeafQueue#activateApplications}}, stored in each user's > {{LeafQueue#User}} object, and retrieved via > {{UserInfo#getResourceUsageInfo}}. > The problem is that this user-specific AM limit depends on the activity of > other users and other applications in a queue, and it is only calculated and > updated when a user's application is activated. So, when > {{CapacitySchedulerPage}} retrieves the user-specific AM limit, it is a stale > value unless an application was recently activated for a particular user. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7619) Max AM Resource value in CS UI is different for every user
Eric Payne created YARN-7619: Summary: Max AM Resource value in CS UI is different for every user Key: YARN-7619 URL: https://issues.apache.org/jira/browse/YARN-7619 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler, yarn Affects Versions: 3.0.0-beta1, 2.9.0, 2.8.2, 3.1.0 Reporter: Eric Payne Assignee: Eric Payne YARN-7245 addressed the problem that the {{Max AM Resource}} in the capacity scheduler UI used to contain the queue-level AM limit instead of the user-level AM limit. It fixed this by using the user-specific AM limit that is calculated in {{LeafQueue#activateApplications}}, stored in each user's {{LeafQueue#User}} object, and retrieved via {{UserInfo#getResourceUsageInfo}}. The problem is that this user-specific AM limit depends on the activity of other users and other applications in a queue, and it is only calculated and updated when a user's application is activated. So, when {{CapacitySchedulerPage}} retrieves the user-specific AM limit, it is a stale value unless an application was recently activated for a particular user. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6124) Make SchedulingEditPolicy can be enabled / disabled / updated with RMAdmin -refreshQueues
[ https://issues.apache.org/jira/browse/YARN-6124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276200#comment-16276200 ] Eric Payne commented on YARN-6124: -- Sorry for the delay. Belatedly, I am fine with the patch. > Make SchedulingEditPolicy can be enabled / disabled / updated with RMAdmin > -refreshQueues > - > > Key: YARN-6124 > URL: https://issues.apache.org/jira/browse/YARN-6124 > Project: Hadoop YARN > Issue Type: Task >Reporter: Wangda Tan >Assignee: Zian Chen > Fix For: 3.1.0 > > Attachments: YARN-6124.4.patch, YARN-6124.5.patch, YARN-6124.6.patch, > YARN-6124.wip.1.patch, YARN-6124.wip.2.patch, YARN-6124.wip.3.patch > > > Now enabled / disable / update SchedulingEditPolicy config requires restart > RM. This is inconvenient when admin wants to make changes to > SchedulingEditPolicies. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7575) When using absolute capacity configuration with no max capacity, scheduler UI NPEs and can't grow queue
[ https://issues.apache.org/jira/browse/YARN-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16269455#comment-16269455 ] Eric Payne commented on YARN-7575: -- Sorry, my bad. My ULF is set to 2.0 on the default queue. After setting it to 3.0, my use case works. On the plus side, we know that ULF works as expected with absolute capacity :) +1 on on the patch. Thanks [~sunilg] > When using absolute capacity configuration with no max capacity, scheduler UI > NPEs and can't grow queue > --- > > Key: YARN-7575 > URL: https://issues.apache.org/jira/browse/YARN-7575 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Eric Payne > Attachments: YARN-7575-YARN-5881.001.patch > > > I encountered the following while reviewing and testing branch YARN-5881. > The design document from YARN-5881 says that for max-capacity: > {quote} > 3) For each queue, we require: > a) if max-resource not set, it automatically set to parent.max-resource > {quote} > When I try leaving blank {{yarn.scheduler.capacity.< > queue-path>.maximum-capacity}}, the RMUI scheduler page refuses to render. It > looks like it's in {{CapacitySchedulerPage$ LeafQueueInfoBlock}}: > {noformat} > 2017-11-28 11:29:16,974 [qtp43473566-220] ERROR webapp.Dispatcher: error > handling URI: /cluster/scheduler > java.lang.reflect.InvocationTargetException > ... > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(CapacitySchedulerPage.java:164) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderLeafQueueInfoWithoutParition(CapacitySchedulerPage.java:129) > {noformat} > Also... A job will run in the leaf queue with no max capacity set and it will > grow to the max capacity of the cluster, but if I add resources to the node, > the job won't grow any more even though it has pending resources. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7575) When using absolute capacity configuration with no max capacity, scheduler UI NPEs and can't grow queue
[ https://issues.apache.org/jira/browse/YARN-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16269175#comment-16269175 ] Eric Payne commented on YARN-7575: -- [~sunilg], the fix for the UI NPE looks good, but the other problem I'm having is that when I increase a node size, the queue doesn't grow. My configs are as follows: - 4 node managers, 5120GB and 10 Vcores each for a total of [20480GB, 40 VCores] - {{yarn.scheduler.capacity.root.default.capacity}}: [memory=10240,vcores=20] - {{yarn.scheduler.capacity.root.eng.capacity}}: [memory=10240,vcores=20] - Note that I do not set root.capacity, nor do I set any maximum-capacity. My use case is as follows: - I start a job requesting 22.5GB and 45 vcores (container size=0.5GB) - the job consumes 20GB and 40 vcores - I add 2.5GB and 5 vcores to one of the nodes: {{yarn rmadmin -updateNodeResource host:port 7680 15}} - One more container is assigned to the job, but that only brings the job to 20.5GB and 41 vcores. > When using absolute capacity configuration with no max capacity, scheduler UI > NPEs and can't grow queue > --- > > Key: YARN-7575 > URL: https://issues.apache.org/jira/browse/YARN-7575 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Eric Payne > Attachments: YARN-7575-YARN-5881.001.patch > > > I encountered the following while reviewing and testing branch YARN-5881. > The design document from YARN-5881 says that for max-capacity: > {quote} > 3) For each queue, we require: > a) if max-resource not set, it automatically set to parent.max-resource > {quote} > When I try leaving blank {{yarn.scheduler.capacity.< > queue-path>.maximum-capacity}}, the RMUI scheduler page refuses to render. It > looks like it's in {{CapacitySchedulerPage$ LeafQueueInfoBlock}}: > {noformat} > 2017-11-28 11:29:16,974 [qtp43473566-220] ERROR webapp.Dispatcher: error > handling URI: /cluster/scheduler > java.lang.reflect.InvocationTargetException > ... > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(CapacitySchedulerPage.java:164) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderLeafQueueInfoWithoutParition(CapacitySchedulerPage.java:129) > {noformat} > Also... A job will run in the leaf queue with no max capacity set and it will > grow to the max capacity of the cluster, but if I add resources to the node, > the job won't grow any more even though it has pending resources. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7575) When using absolute capacity configuration with no max capacity, scheduler UI NPEs and can't grow queue
[ https://issues.apache.org/jira/browse/YARN-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-7575: - Issue Type: Sub-task (was: Bug) Parent: YARN-5881 > When using absolute capacity configuration with no max capacity, scheduler UI > NPEs and can't grow queue > --- > > Key: YARN-7575 > URL: https://issues.apache.org/jira/browse/YARN-7575 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Eric Payne > > I encountered the following while reviewing and testing branch YARN-5881. > The design document from YARN-5881 says that for max-capacity: > {quote} > 3) For each queue, we require: > a) if max-resource not set, it automatically set to parent.max-resource > {quote} > When I try leaving blank {{yarn.scheduler.capacity.< > queue-path>.maximum-capacity}}, the RMUI scheduler page refuses to render. It > looks like it's in {{CapacitySchedulerPage$ LeafQueueInfoBlock}}: > {noformat} > 2017-11-28 11:29:16,974 [qtp43473566-220] ERROR webapp.Dispatcher: error > handling URI: /cluster/scheduler > java.lang.reflect.InvocationTargetException > ... > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(CapacitySchedulerPage.java:164) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderLeafQueueInfoWithoutParition(CapacitySchedulerPage.java:129) > {noformat} > Also... A job will run in the leaf queue with no max capacity set and it will > grow to the max capacity of the cluster, but if I add resources to the node, > the job won't grow any more even though it has pending resources. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7575) When using absolute capacity configuration with no max capacity, scheduler UI NPEs and can't grow queue
Eric Payne created YARN-7575: Summary: When using absolute capacity configuration with no max capacity, scheduler UI NPEs and can't grow queue Key: YARN-7575 URL: https://issues.apache.org/jira/browse/YARN-7575 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler Reporter: Eric Payne I encountered the following while reviewing and testing branch YARN-5881. The design document from YARN-5881 says that for max-capacity: {quote} 3) For each queue, we require: a) if max-resource not set, it automatically set to parent.max-resource {quote} When I try leaving blank {{yarn.scheduler.capacity.< queue-path>.maximum-capacity}}, the RMUI scheduler page refuses to render. It looks like it's in {{CapacitySchedulerPage$ LeafQueueInfoBlock}}: {noformat} 2017-11-28 11:29:16,974 [qtp43473566-220] ERROR webapp.Dispatcher: error handling URI: /cluster/scheduler java.lang.reflect.InvocationTargetException ... at org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(CapacitySchedulerPage.java:164) at org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderLeafQueueInfoWithoutParition(CapacitySchedulerPage.java:129) {noformat} Also... A job will run in the leaf queue with no max capacity set and it will grow to the max capacity of the cluster, but if I add resources to the node, the job won't grow any more even though it has pending resources. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7496) CS Intra-queue preemption user-limit calculations are not in line with LeafQueue user-limit calculations
[ https://issues.apache.org/jira/browse/YARN-7496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265837#comment-16265837 ] Eric Payne commented on YARN-7496: -- Thank you very much, [~leftnoteasy] > CS Intra-queue preemption user-limit calculations are not in line with > LeafQueue user-limit calculations > > > Key: YARN-7496 > URL: https://issues.apache.org/jira/browse/YARN-7496 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.2 >Reporter: Eric Payne >Assignee: Eric Payne > Fix For: 2.8.3 > > Attachments: YARN-7496.001.branch-2.8.patch > > > Only a problem in 2.8. > Preemption could oscillate due to the difference in how user limit is > calculated between 2.8 and later releases. > Basically (ignoring ULF, MULP, and maybe others), the calculation for user > limit on the Capacity Scheduler side in 2.8 is {{total used resources / > number of active users}} while the calculation in later releases is {{total > active resources / number of active users}}. When intra-queue preemption was > backported to 2.8, it's calculations for user limit were more aligned with > the latter algorithm, which is in 2.9 and later releases. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7496) CS Intra-queue preemption user-limit calculations are not in line with LeafQueue user-limit calculations
[ https://issues.apache.org/jira/browse/YARN-7496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261617#comment-16261617 ] Eric Payne commented on YARN-7496: -- {code} hadoop.yarn.server.resourcemanager.TestClientRMTokens hadoop.yarn.server.resourcemanager.TestAMAuthorization hadoop.yarn.server.resourcemanager.TestLeaderElectorService {code} These tests are passing for me in my local environment. > CS Intra-queue preemption user-limit calculations are not in line with > LeafQueue user-limit calculations > > > Key: YARN-7496 > URL: https://issues.apache.org/jira/browse/YARN-7496 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.2 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-7496.001.branch-2.8.patch > > > Only a problem in 2.8. > Preemption could oscillate due to the difference in how user limit is > calculated between 2.8 and later releases. > Basically (ignoring ULF, MULP, and maybe others), the calculation for user > limit on the Capacity Scheduler side in 2.8 is {{total used resources / > number of active users}} while the calculation in later releases is {{total > active resources / number of active users}}. When intra-queue preemption was > backported to 2.8, it's calculations for user limit were more aligned with > the latter algorithm, which is in 2.9 and later releases. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7533) Documentation for absolute resource support in CS
[ https://issues.apache.org/jira/browse/YARN-7533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261565#comment-16261565 ] Eric Payne commented on YARN-7533: -- [~sunilg] this looks good to me. > Documentation for absolute resource support in CS > - > > Key: YARN-7533 > URL: https://issues.apache.org/jira/browse/YARN-7533 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Sunil G >Assignee: Sunil G > Attachments: YARN-7533-YARN-5881.002.patch, YARN-7533.001.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7496) CS Intra-queue preemption user-limit calculations are not in line with LeafQueue user-limit calculations
[ https://issues.apache.org/jira/browse/YARN-7496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261242#comment-16261242 ] Eric Payne commented on YARN-7496: -- Thanks for your review, [~leftnoteasy]. What's the next step? > CS Intra-queue preemption user-limit calculations are not in line with > LeafQueue user-limit calculations > > > Key: YARN-7496 > URL: https://issues.apache.org/jira/browse/YARN-7496 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.2 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-7496.001.branch-2.8.patch > > > Only a problem in 2.8. > Preemption could oscillate due to the difference in how user limit is > calculated between 2.8 and later releases. > Basically (ignoring ULF, MULP, and maybe others), the calculation for user > limit on the Capacity Scheduler side in 2.8 is {{total used resources / > number of active users}} while the calculation in later releases is {{total > active resources / number of active users}}. When intra-queue preemption was > backported to 2.8, it's calculations for user limit were more aligned with > the latter algorithm, which is in 2.9 and later releases. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7533) Documentation for absolute resource support in CS
[ https://issues.apache.org/jira/browse/YARN-7533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260064#comment-16260064 ] Eric Payne commented on YARN-7533: -- Hi [~sunilg]. Thanks for the patch. I have just a couple of grammatical suggestions: - {{Resource Allocation}}, I would suggest to change it to the following: {code} + * Resource Allocation using Absolute Resources configuration + `CapacityScheduler` supports configuraiton of absolute resources instead of providing Queue *capacity* in percentage. The following configurations could be used to configure absolute resources. {code} - {{yarn.scheduler.capacity..min-resource}}, something like: {code} + | `yarn.scheduler.capacity..min-resource` | Absolute resource queue capacity minimum configuration. Default value is empty. [memory=10240,vcores=12] is a valid configuration which indicates 10GB Memory and 12 VCores.| + | `yarn.scheduler.capacity..max-resource` | Absolute resource queue capacity maximum configuration. Default value is empty. [memory=10240,vcores=12] is a valid configuration which indicates 10GB Memory and 12 VCores.| {code} > Documentation for absolute resource support in CS > - > > Key: YARN-7533 > URL: https://issues.apache.org/jira/browse/YARN-7533 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Sunil G >Assignee: Sunil G > Attachments: YARN-7533.001.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6124) Make SchedulingEditPolicy can be enabled / disabled / updated with RMAdmin -refreshQueues
[ https://issues.apache.org/jira/browse/YARN-6124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259865#comment-16259865 ] Eric Payne commented on YARN-6124: -- bq. AdminService#refreshQueues, conf.size(): I'm not sure why this is needed I see that if this is not there, it gets the following exception. {noformat} refreshQueues: com.ctc.wstx.exc.WstxIOException: Stream closed {noformat} Still, calling {{conf.size()}} seems awkward. It seems like there should be a better way to do this. > Make SchedulingEditPolicy can be enabled / disabled / updated with RMAdmin > -refreshQueues > - > > Key: YARN-6124 > URL: https://issues.apache.org/jira/browse/YARN-6124 > Project: Hadoop YARN > Issue Type: Task >Reporter: Wangda Tan >Assignee: Zian Chen > Attachments: YARN-6124.wip.1.patch, YARN-6124.wip.2.patch, > YARN-6124.wip.3.patch > > > Now enabled / disable / update SchedulingEditPolicy config requires restart > RM. This is inconvenient when admin wants to make changes to > SchedulingEditPolicies. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6124) Make SchedulingEditPolicy can be enabled / disabled / updated with RMAdmin -refreshQueues
[ https://issues.apache.org/jira/browse/YARN-6124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259540#comment-16259540 ] Eric Payne commented on YARN-6124: -- Thanks [~Zian Chen]. I appreciate the good work here. sorry for the late reply. I have a couple of comments. - {{AdminService#refreshQueues}}, first comment: {code} // We use getConfig() before which gets a capacity-scheduler.xml reference // when parsing it into CapacityScheduler#reinitialize, but we need to get // properties from yarn-site.xml when we want to enable/disable preemption {code} -- I wouldn't say anything about what it did before or about the capacity scheduler, since this calls into all of the schedulers. Also, I wouldn't specify preemption properties since the scheduling monitor can be pluggable and doesen't have to be for preemption. I would just say something like this: {{// Retrieve yarn-site.xml in order to refresh scheduling monitor properties.}} - {{AdminService#refreshQueues}}, {{conf.size()}}: -- Comment says {{force the Configuration#getProps been called to reload all the properties.}}. I'm not sure why this is needed. I'm pretty sure that when {{SchedulingMonitorManager#updateSchedulingMonitors}} calls the following code, it will also call {{Configuration#getProps}} at that point: {code} boolean monitorsEnabled = conf.getBoolean( YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS); {code} > Make SchedulingEditPolicy can be enabled / disabled / updated with RMAdmin > -refreshQueues > - > > Key: YARN-6124 > URL: https://issues.apache.org/jira/browse/YARN-6124 > Project: Hadoop YARN > Issue Type: Task >Reporter: Wangda Tan >Assignee: Zian Chen > Attachments: YARN-6124.wip.1.patch, YARN-6124.wip.2.patch, > YARN-6124.wip.3.patch > > > Now enabled / disable / update SchedulingEditPolicy config requires restart > RM. This is inconvenient when admin wants to make changes to > SchedulingEditPolicies. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7496) CS Intra-queue preemption user-limit calculations are not in line with LeafQueue user-limit calculations
[ https://issues.apache.org/jira/browse/YARN-7496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259412#comment-16259412 ] Eric Payne commented on YARN-7496: -- Thanks for looking at this [~leftnoteasy] > CS Intra-queue preemption user-limit calculations are not in line with > LeafQueue user-limit calculations > > > Key: YARN-7496 > URL: https://issues.apache.org/jira/browse/YARN-7496 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.2 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-7496.001.branch-2.8.patch > > > Only a problem in 2.8. > Preemption could oscillate due to the difference in how user limit is > calculated between 2.8 and later releases. > Basically (ignoring ULF, MULP, and maybe others), the calculation for user > limit on the Capacity Scheduler side in 2.8 is {{total used resources / > number of active users}} while the calculation in later releases is {{total > active resources / number of active users}}. When intra-queue preemption was > backported to 2.8, it's calculations for user limit were more aligned with > the latter algorithm, which is in 2.9 and later releases. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7496) CS Intra-queue preemption user-limit calculations are not in line with LeafQueue user-limit calculations
[ https://issues.apache.org/jira/browse/YARN-7496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-7496: - Attachment: YARN-7496.001.branch-2.8.patch Attaching a fix for branch-2.8. This change is in {{LeafQueue#computeUserLimit}}. It should only affect preemption user limit calculations and should not affect assignment user limit calculations. Since it does touch the computations for user limit, I would really appreciate it if [~sunilg], [~leftnoteasy], or [~jlowe] could take a look at it. > CS Intra-queue preemption user-limit calculations are not in line with > LeafQueue user-limit calculations > > > Key: YARN-7496 > URL: https://issues.apache.org/jira/browse/YARN-7496 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.2 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-7496.001.branch-2.8.patch > > > Only a problem in 2.8. > Preemption could oscillate due to the difference in how user limit is > calculated between 2.8 and later releases. > Basically (ignoring ULF, MULP, and maybe others), the calculation for user > limit on the Capacity Scheduler side in 2.8 is {{total used resources / > number of active users}} while the calculation in later releases is {{total > active resources / number of active users}}. When intra-queue preemption was > backported to 2.8, it's calculations for user limit were more aligned with > the latter algorithm, which is in 2.9 and later releases. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7469) Capacity Scheduler Intra-queue preemption: User can starve if newest app is exactly at user limit
[ https://issues.apache.org/jira/browse/YARN-7469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16255891#comment-16255891 ] Eric Payne commented on YARN-7469: -- Thank you very much [~sunilg]. Just one concern is that this fix should also go into branch-2.9 since it is also in branch-2.8. > Capacity Scheduler Intra-queue preemption: User can starve if newest app is > exactly at user limit > - > > Key: YARN-7469 > URL: https://issues.apache.org/jira/browse/YARN-7469 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.9.0, 3.0.0-beta1, 2.8.2 >Reporter: Eric Payne >Assignee: Eric Payne > Fix For: 2.8.3, 3.0.0, 3.1.0, 2.10.0 > > Attachments: UnitTestToShowStarvedUser.patch, YARN-7469.001.patch > > > Queue Configuration: > - Total Memory: 20GB > - 2 Queues > -- Queue1 > --- Memory: 10GB > --- MULP: 10% > --- ULF: 2.0 > - Minimum Container Size: 0.5GB > Use Case: > - User1 submits app1 to Queue1 and consumes 20GB > - User2 submits app2 to Queue1 and requests 7.5GB > - Preemption monitor preempts 7.5GB from app1. Capacity Scheduler gives those > resources to User2 > - User 3 submits app3 to Queue1. To begin with, app3 is requesting 1 > container for the AM. > - Preemption monitor never preempts a container. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7469) Capacity Scheduler Intra-queue preemption: User can starve if newest app is exactly at user limit
[ https://issues.apache.org/jira/browse/YARN-7469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16253828#comment-16253828 ] Eric Payne commented on YARN-7469: -- Thanks [~sunilg]! That would be great! > Capacity Scheduler Intra-queue preemption: User can starve if newest app is > exactly at user limit > - > > Key: YARN-7469 > URL: https://issues.apache.org/jira/browse/YARN-7469 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.9.0, 3.0.0-beta1, 2.8.2 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: UnitTestToShowStarvedUser.patch, YARN-7469.001.patch > > > Queue Configuration: > - Total Memory: 20GB > - 2 Queues > -- Queue1 > --- Memory: 10GB > --- MULP: 10% > --- ULF: 2.0 > - Minimum Container Size: 0.5GB > Use Case: > - User1 submits app1 to Queue1 and consumes 20GB > - User2 submits app2 to Queue1 and requests 7.5GB > - Preemption monitor preempts 7.5GB from app1. Capacity Scheduler gives those > resources to User2 > - User 3 submits app3 to Queue1. To begin with, app3 is requesting 1 > container for the AM. > - Preemption monitor never preempts a container. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7469) Capacity Scheduler Intra-queue preemption: User can starve if newest app is exactly at user limit
[ https://issues.apache.org/jira/browse/YARN-7469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16253721#comment-16253721 ] Eric Payne commented on YARN-7469: -- bq. now min container is the dead zone here I filed YARN-7501 to include a "dead zone" around the user limit. bq. in 2.8, this fix has a problem of oscillation due to the difference in how user limit is calculated between 2.8 and later releases. [~sunilg], I think this patch should be used to fix the user starvation problem and the 2.8-specific oscillation problem can be handled by YARN-7496. {{YARN-7469.001.patch}} will apply cleanly to all branches back to branch-2.8. > Capacity Scheduler Intra-queue preemption: User can starve if newest app is > exactly at user limit > - > > Key: YARN-7469 > URL: https://issues.apache.org/jira/browse/YARN-7469 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.9.0, 3.0.0-beta1, 2.8.2 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: UnitTestToShowStarvedUser.patch, YARN-7469.001.patch > > > Queue Configuration: > - Total Memory: 20GB > - 2 Queues > -- Queue1 > --- Memory: 10GB > --- MULP: 10% > --- ULF: 2.0 > - Minimum Container Size: 0.5GB > Use Case: > - User1 submits app1 to Queue1 and consumes 20GB > - User2 submits app2 to Queue1 and requests 7.5GB > - Preemption monitor preempts 7.5GB from app1. Capacity Scheduler gives those > resources to User2 > - User 3 submits app3 to Queue1. To begin with, app3 is requesting 1 > container for the AM. > - Preemption monitor never preempts a container. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7501) Capacity Scheduler Intra-queue preemption should have a "dead zone" around user limit
Eric Payne created YARN-7501: Summary: Capacity Scheduler Intra-queue preemption should have a "dead zone" around user limit Key: YARN-7501 URL: https://issues.apache.org/jira/browse/YARN-7501 Project: Hadoop YARN Issue Type: Improvement Components: capacity scheduler, scheduler preemption Affects Versions: 3.0.0-beta1, 2.9.0, 2.8.2, 3.1.0 Reporter: Eric Payne -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7496) CS Intra-queue preemption user-limit calculations are not in line with LeafQueue user-limit calculations
[ https://issues.apache.org/jira/browse/YARN-7496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252643#comment-16252643 ] Eric Payne commented on YARN-7496: -- Cluster Configuration: - Cluster Memory: 20GB - Queue1 capacity and max capacity: 50% : 100% - Queue2 capacity and max capacity: 50% : 100% - Queue1: Intra-queue preemption: enabled - Default container size: 0.5GB Use Case: - User1 submits App1 in Queue1 and consumes 12.5GB - User2 submits App2 in Queue1 and consumes 7.5GB - User3 submits App3 in Queue1 - Preemption monitor calculates user limit to be {{((total used resources in Queue1) / (number of all users)) + (1 container) = normalizeup((20GB/3),0.5GB) + 0.5GB = 7GB + 0.5GB = 7.5GB}} - Preemption monitor sees that App1 is the only one that has resources, so it tries to preempts containers down to 7.5GB from {{App1}}. - The problem comes here: Capacity Scheduler calculates user limit to be {{((total used resources in Queue1) / (number of active users)) + (1 container) = normalizeup((20GB/2),0.5GB) + 0.5GB = 10GB + 0.5GB = 10.5GB}} - Therefore, once {{App1}} gets to 10.5GB, the preemption monitor will try to preempt 2.5GB more resources from {{App1}}, but the Capacity Scheduler gives them back. This creates oscillation. > CS Intra-queue preemption user-limit calculations are not in line with > LeafQueue user-limit calculations > > > Key: YARN-7496 > URL: https://issues.apache.org/jira/browse/YARN-7496 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.2 >Reporter: Eric Payne >Assignee: Eric Payne > > Only a problem in 2.8. > Preemption could oscillate due to the difference in how user limit is > calculated between 2.8 and later releases. > Basically (ignoring ULF, MULP, and maybe others), the calculation for user > limit on the Capacity Scheduler side in 2.8 is {{total used resources / > number of active users}} while the calculation in later releases is {{total > active resources / number of active users}}. When intra-queue preemption was > backported to 2.8, it's calculations for user limit were more aligned with > the latter algorithm, which is in 2.9 and later releases. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7496) CS Intra-queue preemption user-limit calculations are not in line with LeafQueue user-limit calculations
Eric Payne created YARN-7496: Summary: CS Intra-queue preemption user-limit calculations are not in line with LeafQueue user-limit calculations Key: YARN-7496 URL: https://issues.apache.org/jira/browse/YARN-7496 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.8.2 Reporter: Eric Payne Assignee: Eric Payne Only a problem in 2.8. Preemption could oscillate due to the difference in how user limit is calculated between 2.8 and later releases. Basically (ignoring ULF, MULP, and maybe others), the calculation for user limit on the Capacity Scheduler side in 2.8 is {{total used resources / number of active users}} while the calculation in later releases is {{total active resources / number of active users}}. When intra-queue preemption was backported to 2.8, it's calculations for user limit were more aligned with the latter algorithm, which is in 2.9 and later releases. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7469) Capacity Scheduler Intra-queue preemption: User can starve if newest app is exactly at user limit
[ https://issues.apache.org/jira/browse/YARN-7469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250237#comment-16250237 ] Eric Payne commented on YARN-7469: -- bq. In broader perspective, i think we are lacking dead zone here. In a way, now min container is the dead zone here. But if user gets more control on this, may be more oscillations could be avoided. May be we can take up that also in another ticket. [~sunilg], Thanks for looking at the patch. Yes, I agree that a dead zone above the user limit would be a very helpful feature to add. > Capacity Scheduler Intra-queue preemption: User can starve if newest app is > exactly at user limit > - > > Key: YARN-7469 > URL: https://issues.apache.org/jira/browse/YARN-7469 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.9.0, 3.0.0-beta1, 2.8.2 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: UnitTestToShowStarvedUser.patch, YARN-7469.001.patch > > > Queue Configuration: > - Total Memory: 20GB > - 2 Queues > -- Queue1 > --- Memory: 10GB > --- MULP: 10% > --- ULF: 2.0 > - Minimum Container Size: 0.5GB > Use Case: > - User1 submits app1 to Queue1 and consumes 20GB > - User2 submits app2 to Queue1 and requests 7.5GB > - Preemption monitor preempts 7.5GB from app1. Capacity Scheduler gives those > resources to User2 > - User 3 submits app3 to Queue1. To begin with, app3 is requesting 1 > container for the AM. > - Preemption monitor never preempts a container. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7469) Capacity Scheduler Intra-queue preemption: User can starve if newest app is exactly at user limit
[ https://issues.apache.org/jira/browse/YARN-7469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-7469: - Attachment: YARN-7469.001.patch Attaching a proposal for a patch to fix this problem. Proposed fix: In {{calculateToBePreemptedResourcePerApp}}, if {{USERLIMIT_FIRST}} policy is set, subtract off minimum container size. Basically, the code in {{skipContainerBasedOnintraQueuePolicy}} skips the container if it will bring it down to the user limit because the capacity scheduler assigns one container more than the user limit. Also, in 2.8, this fix has a problem of oscillation due to the difference in how user limit is calculated between 2.8 and later releases. Basically (ignoring ULF, MULP, and maybe others), the calculation in 2.8 is {{total used resources / number of active users}} while the calculation in later releases is {{total active resources / number of active users}}. With this fix in 2.8, it would cause the value of {{getResourceLimitForAllUsers}} (used by preemption monitor) to be greater than {{getHeadroom}} used by leafqueue, which would cause more preemption to occur than necessary. Bottom line is that I'm still working on a 2.8 solution. > Capacity Scheduler Intra-queue preemption: User can starve if newest app is > exactly at user limit > - > > Key: YARN-7469 > URL: https://issues.apache.org/jira/browse/YARN-7469 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.9.0, 3.0.0-beta1, 2.8.2 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: UnitTestToShowStarvedUser.patch, YARN-7469.001.patch > > > Queue Configuration: > - Total Memory: 20GB > - 2 Queues > -- Queue1 > --- Memory: 10GB > --- MULP: 10% > --- ULF: 2.0 > - Minimum Container Size: 0.5GB > Use Case: > - User1 submits app1 to Queue1 and consumes 20GB > - User2 submits app2 to Queue1 and requests 7.5GB > - Preemption monitor preempts 7.5GB from app1. Capacity Scheduler gives those > resources to User2 > - User 3 submits app3 to Queue1. To begin with, app3 is requesting 1 > container for the AM. > - Preemption monitor never preempts a container. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7469) Capacity Scheduler Intra-queue preemption: User can starve if newest app is exactly at user limit
[ https://issues.apache.org/jira/browse/YARN-7469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16247790#comment-16247790 ] Eric Payne commented on YARN-7469: -- When a queue is in the state as described above, {{FifoIntraQueuePreemptionPlugin#calculateToBePreemptedResourcePerApp}} decides (erroneously, I believe) that {{app2}} has preemptable resources. Since {{app2}} is the youngest with apparent resources, {{FifoIntraQueuePreemptionPlugin#preemptFromLeastStarvedApp}} selects a container to preempt from {{app2}}. However, when it calls {{FifoIntraQueuePreemptionPlugin#skipContainerBasedOnIntraQueuePolicy}}, it decides that preempting the selected container would bring the user limit down too far, so it skips the container. However, it doesn't go on to the next youngest app with resources. The logic breaks down to basically this: {code} calculateToBePreemptedResourcePerApp { // preemtableFromApp will be used to select containers to preempt. preemtableFromApp = used - (userlimit - AmSize) } skipContainerBasedOnIntraQueuePolicy { if (used - selectedContainerSize) <= (userlimit + AmSize) { Skip this container } } {code} We get into this starvation mode when {{selectedContainerSize}} ends up being the same size as {{preemtableFromApp}} > Capacity Scheduler Intra-queue preemption: User can starve if newest app is > exactly at user limit > - > > Key: YARN-7469 > URL: https://issues.apache.org/jira/browse/YARN-7469 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.9.0, 3.0.0-beta1, 2.8.2 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: UnitTestToShowStarvedUser.patch > > > Queue Configuration: > - Total Memory: 20GB > - 2 Queues > -- Queue1 > --- Memory: 10GB > --- MULP: 10% > --- ULF: 2.0 > - Minimum Container Size: 0.5GB > Use Case: > - User1 submits app1 to Queue1 and consumes 20GB > - User2 submits app2 to Queue1 and requests 7.5GB > - Preemption monitor preempts 7.5GB from app1. Capacity Scheduler gives those > resources to User2 > - User 3 submits app3 to Queue1. To begin with, app3 is requesting 1 > container for the AM. > - Preemption monitor never preempts a container. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7469) Capacity Scheduler Intra-queue preemption: User can starve if newest app is exactly at user limit
[ https://issues.apache.org/jira/browse/YARN-7469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-7469: - Attachment: UnitTestToShowStarvedUser.patch Uploading a unit test that demonstrates this. > Capacity Scheduler Intra-queue preemption: User can starve if newest app is > exactly at user limit > - > > Key: YARN-7469 > URL: https://issues.apache.org/jira/browse/YARN-7469 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.9.0, 3.0.0-beta1, 2.8.2 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: UnitTestToShowStarvedUser.patch > > > Queue Configuration: > - Total Memory: 20GB > - 2 Queues > -- Queue1 > --- Memory: 10GB > --- MULP: 10% > --- ULF: 2.0 > - Minimum Container Size: 0.5GB > Use Case: > - User1 submits app1 to Queue1 and consumes 20GB > - User2 submits app2 to Queue1 and requests 7.5GB > - Preemption monitor preempts 7.5GB from app1. Capacity Scheduler gives those > resources to User2 > - User 3 submits app3 to Queue1. To begin with, app3 is requesting 1 > container for the AM. > - Preemption monitor never preempts a container. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7469) Capacity Scheduler Intra-queue preemption: User can starve if newest app is exactly at user limit
[ https://issues.apache.org/jira/browse/YARN-7469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-7469: - Description: Queue Configuration: - Total Memory: 20GB - 2 Queues -- Queue1 --- Memory: 10GB --- MULP: 10% --- ULF: 2.0 - Minimum Container Size: 0.5GB Use Case: - User1 submits app1 to Queue1 and consumes 20GB - User2 submits app2 to Queue1 and requests 7.5GB - Preemption monitor preempts 7.5GB from app1. Capacity Scheduler gives those resources to User2 - User 3 submits app3 to Queue1. To begin with, app3 is requesting 1 container for the AM. - Preemption monitor never preempts a container. > Capacity Scheduler Intra-queue preemption: User can starve if newest app is > exactly at user limit > - > > Key: YARN-7469 > URL: https://issues.apache.org/jira/browse/YARN-7469 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.9.0, 3.0.0-beta1, 2.8.2 >Reporter: Eric Payne >Assignee: Eric Payne > > Queue Configuration: > - Total Memory: 20GB > - 2 Queues > -- Queue1 > --- Memory: 10GB > --- MULP: 10% > --- ULF: 2.0 > - Minimum Container Size: 0.5GB > Use Case: > - User1 submits app1 to Queue1 and consumes 20GB > - User2 submits app2 to Queue1 and requests 7.5GB > - Preemption monitor preempts 7.5GB from app1. Capacity Scheduler gives those > resources to User2 > - User 3 submits app3 to Queue1. To begin with, app3 is requesting 1 > container for the AM. > - Preemption monitor never preempts a container. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7469) Capacity Scheduler Intra-queue preemption: User can starve if newest app is exactly at user limit
Eric Payne created YARN-7469: Summary: Capacity Scheduler Intra-queue preemption: User can starve if newest app is exactly at user limit Key: YARN-7469 URL: https://issues.apache.org/jira/browse/YARN-7469 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler, yarn Affects Versions: 3.0.0-beta1, 2.9.0, 2.8.2 Reporter: Eric Payne Assignee: Eric Payne -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7424) Capacity Scheduler Intra-queue preemption: add property to only preempt up to configured MULP
[ https://issues.apache.org/jira/browse/YARN-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237768#comment-16237768 ] Eric Payne commented on YARN-7424: -- Thanks for the review, [~sunilg]. bq. "yarn.resourcemanager.monitor.capacity.preemption.intra-queue-preemption.minimum-threshold" could be configured to start intra queue preemption on a queue. But yes, this is generic in all queues. IIUC {{minimum-threshold}} will prevent intra-queue preemption from acting within a queue until the queue's used resources are above {{minimum-thresshold * capacity}}, which is not really helpful here. bq. max-allowable-limit helps only to control preemption in a given round of preemption calculation. This could be configured to a very low value so that only few resource will be preempted in such cases. Yes, and we could try to reduce this value even more, which could potentially be helpful. I am just surprised that reducing this from 20% to 3% did not have nearly as much effect as I expected. bq. Now the solution which you mentioned will help to control preemption. Actually, after thinking about it more, the proposed solution is not very useful. Here's why: - Queue1 is configured with 1% MULP - User1 submits app1 to queue1 and consumes 100% of the resources - User2 submits app2 to queue1 and requests resources - The preemption monitor preempts resources from aap1 and the capacity scheduler gives them to app2 until app2 is at 1% - User 3 submits app3 to queue1 and requests resources. - The preemption monitor preempts resources from app1, but the capacity scheduler doesn't give them to app3. It gives them to app2 because the user limit resource value is 33%, and app2 came before app3, and user2 is below 33%. - So, with the proposed solution, user3 keeps asking for resources and the preemption monitor keeps taking them from app1 and the capacity scheduler keeps giving them to app2 until user2 is above 33%. If you multiply this out to 60 users all asking for resources in a queue with 1% MULP, it is doing pretty much the exact same amount of preempting and balancing as before. In order to create the "desired" behavior, we would have to fundamentally change the way the capacity scheduler works, which we don't want to do. > Capacity Scheduler Intra-queue preemption: add property to only preempt up to > configured MULP > - > > Key: YARN-7424 > URL: https://issues.apache.org/jira/browse/YARN-7424 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler, scheduler preemption >Affects Versions: 3.0.0-beta1, 2.8.2 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Major > > If the queue's configured minimum user limit percent (MULP) is something > small like 1%, all users will max out well over their MULP until 100 users > have apps in the queue. Since the intra-queue preemption monitor tries to > balance the resource among the users, most of the time in this use case it > will be preempting containers on behalf of users that are already over their > MULP guarantee. > This JIRA proposes that a property should be provided so that a queue can be > configured to only preempt on behalf of a user until that user has reached > its MULP. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7370) Preemption properties should be refreshable
[ https://issues.apache.org/jira/browse/YARN-7370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16236151#comment-16236151 ] Eric Payne commented on YARN-7370: -- Findbugs warnings are for {{org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.DominantResourceFairnessPolicy}} and are not related to this patch. {{CapacitySchedulerConfiguration.java}} has an unused import for {{ImmutableMap}}. I'll go ahead and remove it as part of the commit. > Preemption properties should be refreshable > --- > > Key: YARN-7370 > URL: https://issues.apache.org/jira/browse/YARN-7370 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler preemption >Affects Versions: 2.8.0, 3.0.0-alpha3 >Reporter: Eric Payne >Assignee: Gergely Novák >Priority: Major > Attachments: YARN-7370.001.patch, YARN-7370.002.patch, > YARN-7370.003.patch, YARN-7370.004.patch, YARN-7370.005.patch > > > At least the properties for {{max-allowable-limit}} and {{minimum-threshold}} > should be refreshable. It would also be nice to make > {{intra-queue-preemption.enabled}} and {{preemption-order-policy}} > refreshable. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7370) Preemption properties should be refreshable
[ https://issues.apache.org/jira/browse/YARN-7370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16236089#comment-16236089 ] Eric Payne commented on YARN-7370: -- Waiting for precommit to complete: https://builds.apache.org/job/PreCommit-YARN-Build/18307/ > Preemption properties should be refreshable > --- > > Key: YARN-7370 > URL: https://issues.apache.org/jira/browse/YARN-7370 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler preemption >Affects Versions: 2.8.0, 3.0.0-alpha3 >Reporter: Eric Payne >Assignee: Gergely Novák >Priority: Major > Attachments: YARN-7370.001.patch, YARN-7370.002.patch, > YARN-7370.003.patch, YARN-7370.004.patch > > > At least the properties for {{max-allowable-limit}} and {{minimum-threshold}} > should be refreshable. It would also be nice to make > {{intra-queue-preemption.enabled}} and {{preemption-order-policy}} > refreshable. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7370) Preemption properties should be refreshable
[ https://issues.apache.org/jira/browse/YARN-7370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16235862#comment-16235862 ] Eric Payne commented on YARN-7370: -- [~sunilg], thanks for your comment. I see what you mean. If we take out the {{if (this.csConfig != null)}}, then the values will also be logged during initialization as well as during refresh. That way we can compare newly logged values with the initial ones. [~GergelyNovak], Sorry for changing it at this point, but [~sunilg] has a good point. Would you mind making this change? > Preemption properties should be refreshable > --- > > Key: YARN-7370 > URL: https://issues.apache.org/jira/browse/YARN-7370 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler preemption >Affects Versions: 2.8.0, 3.0.0-alpha3 >Reporter: Eric Payne >Assignee: Gergely Novák >Priority: Major > Attachments: YARN-7370.001.patch, YARN-7370.002.patch, > YARN-7370.003.patch > > > At least the properties for {{max-allowable-limit}} and {{minimum-threshold}} > should be refreshable. It would also be nice to make > {{intra-queue-preemption.enabled}} and {{preemption-order-policy}} > refreshable. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7370) Preemption properties should be refreshable
[ https://issues.apache.org/jira/browse/YARN-7370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16235817#comment-16235817 ] Eric Payne commented on YARN-7370: -- [~GergelyNovak], Thanks for your effort on this feature. Patch LGTM +1 > Preemption properties should be refreshable > --- > > Key: YARN-7370 > URL: https://issues.apache.org/jira/browse/YARN-7370 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler preemption >Affects Versions: 2.8.0, 3.0.0-alpha3 >Reporter: Eric Payne >Assignee: Gergely Novák >Priority: Major > Attachments: YARN-7370.001.patch, YARN-7370.002.patch, > YARN-7370.003.patch > > > At least the properties for {{max-allowable-limit}} and {{minimum-threshold}} > should be refreshable. It would also be nice to make > {{intra-queue-preemption.enabled}} and {{preemption-order-policy}} > refreshable. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7370) Preemption properties should be refreshable
[ https://issues.apache.org/jira/browse/YARN-7370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16235747#comment-16235747 ] Eric Payne commented on YARN-7370: -- [~subru] and [~asuresh], I am aware that you have requested that things should not be checked into branch-2.9. However, we need this feature to go into branch-2.8, and I think it will be awkward if this feature is in 2.8, branch-2, branch-3 and trunk, but not branch-2.9. Would it be appropriate to commit this to branch-2.9 as well? > Preemption properties should be refreshable > --- > > Key: YARN-7370 > URL: https://issues.apache.org/jira/browse/YARN-7370 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler preemption >Affects Versions: 2.8.0, 3.0.0-alpha3 >Reporter: Eric Payne >Assignee: Gergely Novák >Priority: Major > Attachments: YARN-7370.001.patch, YARN-7370.002.patch, > YARN-7370.003.patch > > > At least the properties for {{max-allowable-limit}} and {{minimum-threshold}} > should be refreshable. It would also be nice to make > {{intra-queue-preemption.enabled}} and {{preemption-order-policy}} > refreshable. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7370) Preemption properties should be refreshable
[ https://issues.apache.org/jira/browse/YARN-7370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16235740#comment-16235740 ] Eric Payne commented on YARN-7370: -- Thanks for the patch, [~GergelyNovak]. I will review today. bq. The reason I changed the default constants is the precision problem of the automatic conversion: 0.1 becomes 0.1000149011612 and it looked funny in the newly introduced log messa Fair enough. > Preemption properties should be refreshable > --- > > Key: YARN-7370 > URL: https://issues.apache.org/jira/browse/YARN-7370 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler preemption >Affects Versions: 2.8.0, 3.0.0-alpha3 >Reporter: Eric Payne >Assignee: Gergely Novák >Priority: Major > Attachments: YARN-7370.001.patch, YARN-7370.002.patch, > YARN-7370.003.patch > > > At least the properties for {{max-allowable-limit}} and {{minimum-threshold}} > should be refreshable. It would also be nice to make > {{intra-queue-preemption.enabled}} and {{preemption-order-policy}} > refreshable. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-7424) Capacity Scheduler Intra-queue preemption: add property to only preempt up to configured MULP
[ https://issues.apache.org/jira/browse/YARN-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne reassigned YARN-7424: Assignee: Eric Payne > Capacity Scheduler Intra-queue preemption: add property to only preempt up to > configured MULP > - > > Key: YARN-7424 > URL: https://issues.apache.org/jira/browse/YARN-7424 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler, scheduler preemption >Affects Versions: 3.0.0-beta1, 2.8.2 >Reporter: Eric Payne >Assignee: Eric Payne > > If the queue's configured minimum user limit percent (MULP) is something > small like 1%, all users will max out well over their MULP until 100 users > have apps in the queue. Since the intra-queue preemption monitor tries to > balance the resource among the users, most of the time in this use case it > will be preempting containers on behalf of users that are already over their > MULP guarantee. > This JIRA proposes that a property should be provided so that a queue can be > configured to only preempt on behalf of a user until that user has reached > its MULP. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7424) Capacity Scheduler Intra-queue preemption: add property to only preempt up to configured MULP
[ https://issues.apache.org/jira/browse/YARN-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16227528#comment-16227528 ] Eric Payne commented on YARN-7424: -- In a large, multi-tenant queue with MULP of 1%, after instrumenting intra-queue preemption, we have discovered that enabling both inter-queue and intra-queue preemption causes an order of magnitude more lost work than enabling only inter-queue preemption alone. Even after reducing {{intra-queue-preemption.max-allowable-limit}} from 20% (default) to 3%, the lost work is still several times more than with just inter-queue alone. | | *MemSeconds Lost* | | *Only inter-queue preemption enabled* | {{LostCrossQueueMemSec}} | | *Both inter- and intra-queue preemption enabled with 20% max-allocaion-limit* | {{12.7824 * LostCrossQueueMemSec}} | | *Both inter- and intra-queue preemption enabled with 3% max-allocaion-limit* | {{7.9893 * LostCrossQueueMemSec}} | | | *Vcoreseconds Lost* | | *Only inter-queue preemption enabled* | {{LostCrossQueueVSec}} | | *Both inter- and intra-queue preemption enabled with 20% max-allocaion-limit* | {{26.1885 * LostCrossQueueVSec}} | | *Both inter- and intra-queue preemption enabled with 3% max-allocaion-limit* | {{19.2676 * LostCrossQueueVSec}} | It is expected that turning on intra-queue preemption would increase the number of preemptions. However, an order of magnituded more seems excessive. Also, reducing {{intra-queue-preemption.max-allowable-limit}} didn't have nearly the effect I thought it should. I think there is an underlying design philosophy that should be addressed. The current intra-queue preemption design balances the user limit among all of the users. This calculation is based on the total queue capacity and the number of users in the queue. In a very large queue with a large number of active users, the number of users in the queue is constantly changing. Also, if the node overcommit feature is enabled, the total size of the queue will change as well when the cluster becomes very busy. The result is that preemption must constantly happen in order to balance all of the users. For this reason, we need a configuration property that stops preempting on behalf of a user once the user is above the MULP, which is a stable value. As a variation, we may want to have a "live zone" of MULP plus some configurable value. > Capacity Scheduler Intra-queue preemption: add property to only preempt up to > configured MULP > - > > Key: YARN-7424 > URL: https://issues.apache.org/jira/browse/YARN-7424 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler, scheduler preemption >Affects Versions: 3.0.0-beta1, 2.8.2 >Reporter: Eric Payne > > If the queue's configured minimum user limit percent (MULP) is something > small like 1%, all users will max out well over their MULP until 100 users > have apps in the queue. Since the intra-queue preemption monitor tries to > balance the resource among the users, most of the time in this use case it > will be preempting containers on behalf of users that are already over their > MULP guarantee. > This JIRA proposes that a property should be provided so that a queue can be > configured to only preempt on behalf of a user until that user has reached > its MULP. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7424) Capacity Scheduler Intra-queue preemption: add property to only preempt up to configured MULP
[ https://issues.apache.org/jira/browse/YARN-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-7424: - Description: If the queue's configured minimum user limit percent (MULP) is something small like 1%, all users will max out well over their MULP until 100 users have apps in the queue. Since the intra-queue preemption monitor tries to balance the resource among the users, most of the time in this use case it will be preempting containers on behalf of users that are already over their MULP guarantee. This JIRA proposes that a property should be provided so that a queue can be configured to only preempt on behalf of a user until that user has reached its MULP. was: > Capacity Scheduler Intra-queue preemption: add property to only preempt up to > configured MULP > - > > Key: YARN-7424 > URL: https://issues.apache.org/jira/browse/YARN-7424 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler, scheduler preemption >Affects Versions: 3.0.0-beta1, 2.8.2 >Reporter: Eric Payne > > If the queue's configured minimum user limit percent (MULP) is something > small like 1%, all users will max out well over their MULP until 100 users > have apps in the queue. Since the intra-queue preemption monitor tries to > balance the resource among the users, most of the time in this use case it > will be preempting containers on behalf of users that are already over their > MULP guarantee. > This JIRA proposes that a property should be provided so that a queue can be > configured to only preempt on behalf of a user until that user has reached > its MULP. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7424) Capacity Scheduler Intra-queue preemption: add property to only preempt up to configured MULP
Eric Payne created YARN-7424: Summary: Capacity Scheduler Intra-queue preemption: add property to only preempt up to configured MULP Key: YARN-7424 URL: https://issues.apache.org/jira/browse/YARN-7424 Project: Hadoop YARN Issue Type: Improvement Components: capacity scheduler, scheduler preemption Affects Versions: 3.0.0-beta1, 2.8.2 Reporter: Eric Payne -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7370) Preemption properties should be refreshable
[ https://issues.apache.org/jira/browse/YARN-7370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16227276#comment-16227276 ] Eric Payne commented on YARN-7370: -- [~GergelyNovak], Thanks for the updated patch. Just a couple of things: - Why were {{DEFAULT_PREEMPTION_MAX_IGNORED_OVER_CAPACITY}} and {{DEFAULT_PREEMPTION_NATURAL_TERMINATION_FACTOR}} changed from float to double? The capacity scheduler configuration properties are not consistent about the usage of float and double, but it looks like the preemption properties are using float. If we want to make it consistent or change these to double, I would prefer to do it as a separate JIRA. - Thanks for adding the log documenting the updated properties. Can you please add the following properties to the log statement? -- isIntraQueuePreemptionEnabled -- selectCandidatesForResevedContainers -- isQueuePriorityPreemptionEnabled -- additionalPreemptionBasedOnReservedResource > Preemption properties should be refreshable > --- > > Key: YARN-7370 > URL: https://issues.apache.org/jira/browse/YARN-7370 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler preemption >Affects Versions: 2.8.0, 3.0.0-alpha3 >Reporter: Eric Payne >Assignee: Gergely Novák > Attachments: YARN-7370.001.patch, YARN-7370.002.patch > > > At least the properties for {{max-allowable-limit}} and {{minimum-threshold}} > should be refreshable. It would also be nice to make > {{intra-queue-preemption.enabled}} and {{preemption-order-policy}} > refreshable. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7370) Preemption properties should be refreshable
[ https://issues.apache.org/jira/browse/YARN-7370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16222830#comment-16222830 ] Eric Payne commented on YARN-7370: -- Thanks [~GergelyNovak] for the work on this patch. I just have a couple of small issues with the patch and one suggestion. - {{ProportionalCapacityPreemptionPolicy}} has an unused import of {{YarnConfiguration}} - In {{ProportionalCapacityPreemptionPolicy#updateConfigIfNeeded}}, can we switch the names of the local {{csConfig}} variable and the global class instance variable {{config}}? My opinion is that a class instance variable should have the more descriptive name. - It would be nice if {{updateConfigIfNeeded}} would LOG the values of all of the properties so that we have a record in the RM syslog whenever the values are refreshed. > Preemption properties should be refreshable > --- > > Key: YARN-7370 > URL: https://issues.apache.org/jira/browse/YARN-7370 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler preemption >Affects Versions: 2.8.0, 3.0.0-alpha3 >Reporter: Eric Payne >Assignee: Gergely Novák > Attachments: YARN-7370.001.patch > > > At least the properties for {{max-allowable-limit}} and {{minimum-threshold}} > should be refreshable. It would also be nice to make > {{intra-queue-preemption.enabled}} and {{preemption-order-policy}} > refreshable. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6124) Make SchedulingEditPolicy can be enabled / disabled / updated with RMAdmin -refreshQueues
[ https://issues.apache.org/jira/browse/YARN-6124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221064#comment-16221064 ] Eric Payne commented on YARN-6124: -- Thanks [~leftnoteasy]. I will document my findings and you can work on it when you get to it. YARN-7370 doesn't depend on this JIRA, does it? I got it to move past the NPE, but the changes I made may not be the best (it may have other side effects): {code} public void serviceInit(Configuration conf) throws Exception { Configuration configuration = new Configuration(conf); -super.serviceInit(conf); initScheduler(configuration); +super.serviceInit(conf); } {code} Also, a quick test didn't seem to work. I started the RM with {{yarn.resourcemanager.scheduler.monitor.enable}} set to {{true}}, changed it to false, and then did {{-refreshQueues}}. It's going through the {{updateSchedulingMonitors}} code but it doesn't change the value. > Make SchedulingEditPolicy can be enabled / disabled / updated with RMAdmin > -refreshQueues > - > > Key: YARN-6124 > URL: https://issues.apache.org/jira/browse/YARN-6124 > Project: Hadoop YARN > Issue Type: Task >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-6124.wip.1.patch, YARN-6124.wip.2.patch > > > Now enabled / disable / update SchedulingEditPolicy config requires restart > RM. This is inconvenient when admin wants to make changes to > SchedulingEditPolicies. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6124) Make SchedulingEditPolicy can be enabled / disabled / updated with RMAdmin -refreshQueues
[ https://issues.apache.org/jira/browse/YARN-6124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220844#comment-16220844 ] Eric Payne commented on YARN-6124: -- Thanks [~leftnoteasy]. The proof of concept looks good, but in this version the {{ProportionalCapacityPreemptionPolicy}} is NPE-ing during {{init}} because {{scheduler.getConfiguration()}} is returning null. > Make SchedulingEditPolicy can be enabled / disabled / updated with RMAdmin > -refreshQueues > - > > Key: YARN-6124 > URL: https://issues.apache.org/jira/browse/YARN-6124 > Project: Hadoop YARN > Issue Type: Task >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-6124.wip.1.patch, YARN-6124.wip.2.patch > > > Now enabled / disable / update SchedulingEditPolicy config requires restart > RM. This is inconvenient when admin wants to make changes to > SchedulingEditPolicies. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7370) Preemption properties should be refreshable
[ https://issues.apache.org/jira/browse/YARN-7370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16218617#comment-16218617 ] Eric Payne commented on YARN-7370: -- Thanks [~leftnoteasy] for the further design specifications. bq. YARN-6142, we will take care of all scheduling edit policy refresh. YARN-6142 is closed, so I'm not sure where the actual work will take place. As for the rest, it sounds like a good plan. > Preemption properties should be refreshable > --- > > Key: YARN-7370 > URL: https://issues.apache.org/jira/browse/YARN-7370 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler preemption >Affects Versions: 2.8.0, 3.0.0-alpha3 >Reporter: Eric Payne >Assignee: Gergely Novák > > At least the properties for {{max-allowable-limit}} and {{minimum-threshold}} > should be refreshable. It would also be nice to make > {{intra-queue-preemption.enabled}} and {{preemption-order-policy}} > refreshable. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6124) Make SchedulingEditPolicy can be enabled / disabled / updated with RMAdmin -refreshQueues
[ https://issues.apache.org/jira/browse/YARN-6124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216867#comment-16216867 ] Eric Payne commented on YARN-6124: -- Yes, I agree that these should be part of the scheduler. That makes a lot of sense. > Make SchedulingEditPolicy can be enabled / disabled / updated with RMAdmin > -refreshQueues > - > > Key: YARN-6124 > URL: https://issues.apache.org/jira/browse/YARN-6124 > Project: Hadoop YARN > Issue Type: Task >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-6124.wip.1.patch > > > Now enabled / disable / update SchedulingEditPolicy config requires restart > RM. This is inconvenient when admin wants to make changes to > SchedulingEditPolicies. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7370) Intra-queue preemption properties should be refreshable
[ https://issues.apache.org/jira/browse/YARN-7370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215525#comment-16215525 ] Eric Payne commented on YARN-7370: -- Thanks [~leftnoteasy], [~sunilg], and [~GergelyNovak]. So, just to be clear, I think we would all like the following preemption properties to be refreshable with {{yarn rmadmin -refreshQueues}}: {noformat} yarn.resourcemanager.scheduler.monitor.enable yarn.resourcemanager.monitor.capacity.preemption.max_ignored_over_capacity yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval yarn.resourcemanager.monitor.capacity.preemption.natural_termination_factor yarn.resourcemanager.monitor.capacity.preemption.observe_only yarn.resourcemanager.monitor.capacity.preemption.select_based_on_reserved_containers yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round yarn.scheduler.capacity.lazy-preemption-enabled # Intra-queue-specific properties: yarn.resourcemanager.monitor.capacity.preemption.intra-queue-preemption.enabled yarn.resourcemanager.monitor.capacity.preemption.intra-queue-preemption.minimum-threshold yarn.resourcemanager.monitor.capacity.preemption.intra-queue-preemption.max-allowable-limit yarn.resourcemanager.monitor.capacity.preemption.intra-queue-preemption.preemption-order-policy {noformat} I do NOT think we want to refresh {{yarn.resourcemanager.scheduler.monitor.policies}} since that would require stopping and restarting the monitor thread. At least, if we want to make this refreshable, I suggest that we do it as part of a separate JIRA. Also, just FYI, the {{yarn.scheduler.capacity.root.\[QUEUEPATH\].disable_preemption}} property is already refreshable. > Intra-queue preemption properties should be refreshable > --- > > Key: YARN-7370 > URL: https://issues.apache.org/jira/browse/YARN-7370 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler preemption >Affects Versions: 2.8.0, 3.0.0-alpha3 >Reporter: Eric Payne > > At least the properties for {{max-allowable-limit}} and {{minimum-threshold}} > should be refreshable. It would also be nice to make > {{intra-queue-preemption.enabled}} and {{preemption-order-policy}} > refreshable. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4163) Audit getQueueInfo and getApplications calls
[ https://issues.apache.org/jira/browse/YARN-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215426#comment-16215426 ] Eric Payne commented on YARN-4163: -- Thanks [~jlowe] and [~lichangleo]. I will commit this to trunk, branch-3.0, branch-2, and branch-2.8. > Audit getQueueInfo and getApplications calls > > > Key: YARN-4163 > URL: https://issues.apache.org/jira/browse/YARN-4163 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4163.004.patch, YARN-4163.005.patch, > YARN-4163.006.branch-2.8.patch, YARN-4163.006.patch, > YARN-4163.007.branch-2.8.patch, YARN-4163.007.patch, YARN-4163.2.patch, > YARN-4163.2.patch, YARN-4163.3.patch, YARN-4163.patch > > > getQueueInfo and getApplications seem to sometimes cause spike of load but > not able to confirm due to they are not audit logged. This patch propose to > add them to audit log -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4163) Audit getQueueInfo and getApplications calls
[ https://issues.apache.org/jira/browse/YARN-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-4163: - Attachment: YARN-4163.007.branch-2.8.patch Attach branch-2.8 specific patch > Audit getQueueInfo and getApplications calls > > > Key: YARN-4163 > URL: https://issues.apache.org/jira/browse/YARN-4163 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4163.004.patch, YARN-4163.005.patch, > YARN-4163.006.branch-2.8.patch, YARN-4163.006.patch, > YARN-4163.007.branch-2.8.patch, YARN-4163.007.patch, YARN-4163.2.patch, > YARN-4163.2.patch, YARN-4163.3.patch, YARN-4163.patch > > > getQueueInfo and getApplications seem to sometimes cause spike of load but > not able to confirm due to they are not audit logged. This patch propose to > add them to audit log -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7370) Intra-queue preemption properties should be refreshable
[ https://issues.apache.org/jira/browse/YARN-7370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212646#comment-16212646 ] Eric Payne commented on YARN-7370: -- [~GergelyNovak], thank you for your interest. Please go ahead and take this JIRA. bq. 2) Do you mean to add a new rmadmin command like -refreshSchedulingMonitors or make this part of -refreshQueues? My opinion is to include these as part of the {{-refreshQueues}} option. The queue-specific disable preemption option is refreshable under {{-refreshQueues}}, so I think it makes sense to refresh the others in the same way. > Intra-queue preemption properties should be refreshable > --- > > Key: YARN-7370 > URL: https://issues.apache.org/jira/browse/YARN-7370 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler preemption >Affects Versions: 2.8.0, 3.0.0-alpha3 >Reporter: Eric Payne > > At least the properties for {{max-allowable-limit}} and {{minimum-threshold}} > should be refreshable. It would also be nice to make > {{intra-queue-preemption.enabled}} and {{preemption-order-policy}} > refreshable. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7370) Intra-queue preemption properties should be refreshable
Eric Payne created YARN-7370: Summary: Intra-queue preemption properties should be refreshable Key: YARN-7370 URL: https://issues.apache.org/jira/browse/YARN-7370 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler, scheduler preemption Affects Versions: 3.0.0-alpha3, 2.8.0 Reporter: Eric Payne At least the properties for {{max-allowable-limit}} and {{minimum-threshold}} should be refreshable. It would also be nice to make {{intra-queue-preemption.enabled}} and {{preemption-order-policy}} refreshable. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4163) Audit getQueueInfo and getApplications calls
[ https://issues.apache.org/jira/browse/YARN-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209494#comment-16209494 ] Eric Payne commented on YARN-4163: -- The unit tests are passing for me in my local environment. > Audit getQueueInfo and getApplications calls > > > Key: YARN-4163 > URL: https://issues.apache.org/jira/browse/YARN-4163 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4163.004.patch, YARN-4163.005.patch, > YARN-4163.006.branch-2.8.patch, YARN-4163.006.patch, YARN-4163.007.patch, > YARN-4163.2.patch, YARN-4163.2.patch, YARN-4163.3.patch, YARN-4163.patch > > > getQueueInfo and getApplications seem to sometimes cause spike of load but > not able to confirm due to they are not audit logged. This patch propose to > add them to audit log -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4163) Audit getQueueInfo and getApplications calls
[ https://issues.apache.org/jira/browse/YARN-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-4163: - Attachment: YARN-4163.007.patch Thanks [~jlowe] for the review. I have made some "generic" APIs for successful and failure logs that take the common set of arguments plus an {{ArgsBuilder}} that contains the operation-specific arguments. These generic APIs could be used to replace the existing success and failure log methods. I suggest that a separate JIRA be created for that. I have a separate branch-2.8 patch that I will upload once the pre-commit build completes for this patch. > Audit getQueueInfo and getApplications calls > > > Key: YARN-4163 > URL: https://issues.apache.org/jira/browse/YARN-4163 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4163.004.patch, YARN-4163.005.patch, > YARN-4163.006.branch-2.8.patch, YARN-4163.006.patch, YARN-4163.007.patch, > YARN-4163.2.patch, YARN-4163.2.patch, YARN-4163.3.patch, YARN-4163.patch > > > getQueueInfo and getApplications seem to sometimes cause spike of load but > not able to confirm due to they are not audit logged. This patch propose to > add them to audit log -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7245) In Cap Sched UI, Max AM Resource column in Active Users Info section should be per-user
[ https://issues.apache.org/jira/browse/YARN-7245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193591#comment-16193591 ] Eric Payne commented on YARN-7245: -- Thanks [~sunilg]. Looking forward to your response. > In Cap Sched UI, Max AM Resource column in Active Users Info section should > be per-user > --- > > Key: YARN-7245 > URL: https://issues.apache.org/jira/browse/YARN-7245 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.9.0, 2.8.1, 3.0.0-alpha4 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: CapSched UI Showing Inaccurate Per User Max AM > Resource.png, Max AM Resource Per User -- Fixed.png, YARN-7245.001.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4163) Audit getQueueInfo and getApplications calls
[ https://issues.apache.org/jira/browse/YARN-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-4163: - Attachment: YARN-4163.006.branch-2.8.patch Attaching branch-2.8 patch > Audit getQueueInfo and getApplications calls > > > Key: YARN-4163 > URL: https://issues.apache.org/jira/browse/YARN-4163 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4163.004.patch, YARN-4163.005.patch, > YARN-4163.006.branch-2.8.patch, YARN-4163.006.patch, YARN-4163.2.patch, > YARN-4163.2.patch, YARN-4163.3.patch, YARN-4163.patch > > > getQueueInfo and getApplications seem to sometimes cause spike of load but > not able to confirm due to they are not audit logged. This patch propose to > add them to audit log -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4163) Audit getQueueInfo and getApplications calls
[ https://issues.apache.org/jira/browse/YARN-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-4163: - Attachment: YARN-4163.006.patch Attaching {{YARN-4163.006.patch}} to address checkstyle issues. > Audit getQueueInfo and getApplications calls > > > Key: YARN-4163 > URL: https://issues.apache.org/jira/browse/YARN-4163 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4163.004.patch, YARN-4163.005.patch, > YARN-4163.006.patch, YARN-4163.2.patch, YARN-4163.2.patch, YARN-4163.3.patch, > YARN-4163.patch > > > getQueueInfo and getApplications seem to sometimes cause spike of load but > not able to confirm due to they are not audit logged. This patch propose to > add them to audit log -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4163) Audit getQueueInfo and getApplications calls
[ https://issues.apache.org/jira/browse/YARN-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193102#comment-16193102 ] Eric Payne commented on YARN-4163: -- OK, so there are still some valid checkstyle warnings. Interestingly, when I ran testpatch locally, none of these showed up. The only ones I won't be fixing are those complaining about too many args in the method signature. In order to fix this, I would have to refactor the methods. Also, this patch applies cleanly to trunk, branch-3, and branch-2, but not branch-2.8. I will upload a separate branch-2.8 patch. > Audit getQueueInfo and getApplications calls > > > Key: YARN-4163 > URL: https://issues.apache.org/jira/browse/YARN-4163 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4163.004.patch, YARN-4163.005.patch, > YARN-4163.2.patch, YARN-4163.2.patch, YARN-4163.3.patch, YARN-4163.patch > > > getQueueInfo and getApplications seem to sometimes cause spike of load but > not able to confirm due to they are not audit logged. This patch propose to > add them to audit log -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4163) Audit getQueueInfo and getApplications calls
[ https://issues.apache.org/jira/browse/YARN-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-4163: - Attachment: YARN-4163.005.patch [~jlowe], attaching patch {{YARN-4163.005.patch}}. This contains an ArgsBuilder class, as suggested, and fixes for javadocs warnings. I fixed some of the checkstyle warnings, but others I did not fix due to other considerations. I will comment further once the pre-commit build comes back with the current warnings. > Audit getQueueInfo and getApplications calls > > > Key: YARN-4163 > URL: https://issues.apache.org/jira/browse/YARN-4163 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4163.004.patch, YARN-4163.005.patch, > YARN-4163.2.patch, YARN-4163.2.patch, YARN-4163.3.patch, YARN-4163.patch > > > getQueueInfo and getApplications seem to sometimes cause spike of load but > not able to confirm due to they are not audit logged. This patch propose to > add them to audit log -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7245) In Cap Sched UI, Max AM Resource column in Active Users Info section should be per-user
[ https://issues.apache.org/jira/browse/YARN-7245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-7245: - Attachment: Max AM Resource Per User -- Fixed.png I attached {{YARN-7245.001.patch}} to address this. I also attached a screenshot to show that the value {{Max AM Resource}} column matches the value in the {{Max Application Master Resources Per User}} field. > In Cap Sched UI, Max AM Resource column in Active Users Info section should > be per-user > --- > > Key: YARN-7245 > URL: https://issues.apache.org/jira/browse/YARN-7245 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.9.0, 2.8.1, 3.0.0-alpha4 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: CapSched UI Showing Inaccurate Per User Max AM > Resource.png, Max AM Resource Per User -- Fixed.png, YARN-7245.001.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7245) In Cap Sched UI, Max AM Resource column in Active Users Info section should be per-user
[ https://issues.apache.org/jira/browse/YARN-7245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-7245: - Attachment: YARN-7245.001.patch > In Cap Sched UI, Max AM Resource column in Active Users Info section should > be per-user > --- > > Key: YARN-7245 > URL: https://issues.apache.org/jira/browse/YARN-7245 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.9.0, 2.8.1, 3.0.0-alpha4 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: CapSched UI Showing Inaccurate Per User Max AM > Resource.png, YARN-7245.001.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7271) Add a yarn application cost calculation framework in TimelineService v2
[ https://issues.apache.org/jira/browse/YARN-7271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188450#comment-16188450 ] Eric Payne commented on YARN-7271: -- [~vrushalic], The RM has a built-in calculation that keeps track of memory and vcore usage. I'm linking YARN-415 to see if it meets your needs. > Add a yarn application cost calculation framework in TimelineService v2 > --- > > Key: YARN-7271 > URL: https://issues.apache.org/jira/browse/YARN-7271 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineclient, timelinereader, timelineserver >Reporter: Vrushali C > > Timeline Service v2 captures information about a yarn application. From this > info, we would like to calculate the "cost" of an yarn application. This > would be rolled up to the flow level as well (and user and queue level > eventually). > We need a way to accept machine cost (TCO per day) and enable this > calculation. This will help in chargeback for yarn apps. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7245) In Cap Sched UI, Max AM Resource column in Active Users Info section should be per-user
[ https://issues.apache.org/jira/browse/YARN-7245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188018#comment-16188018 ] Eric Payne commented on YARN-7245: -- bq. This is bad. We ideally need user based max-am-limit. [~sunilg], the value for {{Max Application Master Resources Per User}} exists and is used by the scheduler. However, the per-user section under {{Active Users Info}} displays the value for the whole queue instead of per user. This is a problem in the GUI only. > In Cap Sched UI, Max AM Resource column in Active Users Info section should > be per-user > --- > > Key: YARN-7245 > URL: https://issues.apache.org/jira/browse/YARN-7245 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.9.0, 2.8.1, 3.0.0-alpha4 >Reporter: Eric Payne > Attachments: CapSched UI Showing Inaccurate Per User Max AM > Resource.png > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-7245) In Cap Sched UI, Max AM Resource column in Active Users Info section should be per-user
[ https://issues.apache.org/jira/browse/YARN-7245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne reassigned YARN-7245: Assignee: Eric Payne > In Cap Sched UI, Max AM Resource column in Active Users Info section should > be per-user > --- > > Key: YARN-7245 > URL: https://issues.apache.org/jira/browse/YARN-7245 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.9.0, 2.8.1, 3.0.0-alpha4 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: CapSched UI Showing Inaccurate Per User Max AM > Resource.png > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-4163) Audit getQueueInfo and getApplications calls
[ https://issues.apache.org/jira/browse/YARN-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-4163: - Attachment: YARN-4163.004.patch I'm uploading YARN-4163.004.patch to upmerge the patch to trunk. Also, this patch addresses [~jlowe]'s review comments. > Audit getQueueInfo and getApplications calls > > > Key: YARN-4163 > URL: https://issues.apache.org/jira/browse/YARN-4163 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4163.004.patch, YARN-4163.2.patch, > YARN-4163.2.patch, YARN-4163.3.patch, YARN-4163.patch > > > getQueueInfo and getApplications seem to sometimes cause spike of load but > not able to confirm due to they are not audit logged. This patch propose to > add them to audit log -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7084) TestSchedulingMonitor#testRMStarts fails sporadically
[ https://issues.apache.org/jira/browse/YARN-7084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185991#comment-16185991 ] Eric Payne commented on YARN-7084: -- Thanks [~jlowe] for reporting the issue and the fix. The strategy and fix LGTM +1 Will commit soon. > TestSchedulingMonitor#testRMStarts fails sporadically > - > > Key: YARN-7084 > URL: https://issues.apache.org/jira/browse/YARN-7084 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 2.7.4, 3.0.0-alpha4, 2.8.2 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-7084.001.patch > > > TestSchedulingMonitor has been failing sporadically in precommit builds. > Failures look like this: > {noformat} > Running > org.apache.hadoop.yarn.server.resourcemanager.monitor.TestSchedulingMonitor > Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.802 sec <<< > FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.monitor.TestSchedulingMonitor > testRMStarts(org.apache.hadoop.yarn.server.resourcemanager.monitor.TestSchedulingMonitor) > Time elapsed: 1.728 sec <<< FAILURE! > org.mockito.exceptions.verification.WantedButNotInvoked: > Wanted but not invoked: > schedulingEditPolicy.editSchedule(); > -> at > org.apache.hadoop.yarn.server.resourcemanager.monitor.TestSchedulingMonitor.testRMStarts(TestSchedulingMonitor.java:58) > However, there were other interactions with this mock: > -> at > org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.(SchedulingMonitor.java:50) > -> at > org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.serviceInit(SchedulingMonitor.java:61) > -> at > org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.serviceInit(SchedulingMonitor.java:62) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.TestSchedulingMonitor.testRMStarts(TestSchedulingMonitor.java:58) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7249) Fix CapacityScheduler NPE issue when a container preempted while the node is being removed
[ https://issues.apache.org/jira/browse/YARN-7249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-7249: - Fix Version/s: 2.8.2 OK. I cherry-picked this to 2.8.2. Thanks. > Fix CapacityScheduler NPE issue when a container preempted while the node is > being removed > -- > > Key: YARN-7249 > URL: https://issues.apache.org/jira/browse/YARN-7249 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.1 >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Blocker > Fix For: 2.8.2, 2.8.3 > > Attachments: YARN-7249.branch-2.8.001.patch > > > This issue could happen when 3 conditions satisfied: > 1) A node is removing from scheduler. > 2) A container running on the node is being preempted. > 3) A rare race condition causes scheduler pass a null node to leaf queue. > Fix of the problem is to add a null node check inside CapacityScheduler. > Stack trace: > {code} > 2017-08-31 02:51:24,748 FATAL resourcemanager.ResourceManager > (ResourceManager.java:run(714)) - Error in handling event type > KILL_RESERVED_CONTAINER to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1308) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1469) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:497) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.killReservedContainer(CapacityScheduler.java:1505) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1341) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:127) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:705) > > {code} > This is an issue only existed in 2.8.x -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7249) Fix CapacityScheduler NPE issue when a container preempted while the node is being removed
[ https://issues.apache.org/jira/browse/YARN-7249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16180806#comment-16180806 ] Eric Payne commented on YARN-7249: -- {quote} I think it should be fine: containers are properly released when CapacityScheduler#removeNode is called. And if parallel threads access scheduler: queue#completedContainer get invoked with non-null but already removed node, it becomes a no-op. Please let me know if you think different. {quote} Makes sense [~leftnoteasy]. Thanks. +1. Will commit later today. > Fix CapacityScheduler NPE issue when a container preempted while the node is > being removed > -- > > Key: YARN-7249 > URL: https://issues.apache.org/jira/browse/YARN-7249 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.1 >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Blocker > Attachments: YARN-7249.branch-2.8.001.patch > > > This issue could happen when 3 conditions satisfied: > 1) A node is removing from scheduler. > 2) A container running on the node is being preempted. > 3) A rare race condition causes scheduler pass a null node to leaf queue. > Fix of the problem is to add a null node check inside CapacityScheduler. > Stack trace: > {code} > 2017-08-31 02:51:24,748 FATAL resourcemanager.ResourceManager > (ResourceManager.java:run(714)) - Error in handling event type > KILL_RESERVED_CONTAINER to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1308) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1469) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:497) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.killReservedContainer(CapacityScheduler.java:1505) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1341) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:127) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:705) > > {code} > This is an issue only existed in 2.8.x -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7249) Fix CapacityScheduler NPE issue when a container preempted while the node is being removed
[ https://issues.apache.org/jira/browse/YARN-7249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16179653#comment-16179653 ] Eric Payne commented on YARN-7249: -- [~leftnoteasy], I recognize that calling {{queue.completedContainer}} in {{CapacityScheduler#completedContainerInternal}} doesn't make sense if {{node}} is null, but if {{queue.completedContainer}} isn't called, won't that leave references to the container still inside internal structures? And, for example, won't reserved incrimination counters un-decremented? > Fix CapacityScheduler NPE issue when a container preempted while the node is > being removed > -- > > Key: YARN-7249 > URL: https://issues.apache.org/jira/browse/YARN-7249 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.1 >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Blocker > Attachments: YARN-7249.branch-2.8.001.patch > > > This issue could happen when 3 conditions satisfied: > 1) A node is removing from scheduler. > 2) A container running on the node is being preempted. > 3) A rare race condition causes scheduler pass a null node to leaf queue. > Fix of the problem is to add a null node check inside CapacityScheduler. > Stack trace: > {code} > 2017-08-31 02:51:24,748 FATAL resourcemanager.ResourceManager > (ResourceManager.java:run(714)) - Error in handling event type > KILL_RESERVED_CONTAINER to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1308) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1469) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:497) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.killReservedContainer(CapacityScheduler.java:1505) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1341) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:127) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:705) > > {code} > This is an issue only existed in 2.8.x -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7249) Fix CapacityScheduler NPE issue when a container preempted while the node is being removed
[ https://issues.apache.org/jira/browse/YARN-7249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16179553#comment-16179553 ] Eric Payne commented on YARN-7249: -- [~leftnoteasy]: Sure. Looking now. > Fix CapacityScheduler NPE issue when a container preempted while the node is > being removed > -- > > Key: YARN-7249 > URL: https://issues.apache.org/jira/browse/YARN-7249 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.1 >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Blocker > Attachments: YARN-7249.branch-2.8.001.patch > > > This issue could happen when 3 conditions satisfied: > 1) A node is removing from scheduler. > 2) A container running on the node is being preempted. > 3) A rare race condition causes scheduler pass a null node to leaf queue. > Fix of the problem is to add a null node check inside CapacityScheduler. > Stack trace: > {code} > 2017-08-31 02:51:24,748 FATAL resourcemanager.ResourceManager > (ResourceManager.java:run(714)) - Error in handling event type > KILL_RESERVED_CONTAINER to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1308) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1469) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:497) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.killReservedContainer(CapacityScheduler.java:1505) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1341) > > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:127) > > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:705) > > {code} > This is an issue only existed in 2.8.x -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4163) Audit getQueueInfo and getApplications calls
[ https://issues.apache.org/jira/browse/YARN-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16179253#comment-16179253 ] Eric Payne commented on YARN-4163: -- [~lichangleo], please let me know if you plan on up-merging the patch and addressing the above comments. If you need help, please let me know. > Audit getQueueInfo and getApplications calls > > > Key: YARN-4163 > URL: https://issues.apache.org/jira/browse/YARN-4163 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Chang Li >Assignee: Chang Li > Attachments: YARN-4163.2.patch, YARN-4163.2.patch, YARN-4163.3.patch, > YARN-4163.patch > > > getQueueInfo and getApplications seem to sometimes cause spike of load but > not able to confirm due to they are not audit logged. This patch propose to > add them to audit log -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7245) In Cap Sched UI, Max AM Resource column in Active Users Info section should be per-user
[ https://issues.apache.org/jira/browse/YARN-7245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-7245: - Attachment: CapSched UI Showing Inaccurate Per User Max AM Resource.png The value in the {{Max AM Resource}} column in the {{Active Users Info}} section of the Capacity scheduler UI contains the value for {{Max Application Master Resources}}, which is the max for the whole queue. It should be the {{Max Application Master Resources Per User}} value, which is the max AM resources that a single user can use. See the attached screenshot. > In Cap Sched UI, Max AM Resource column in Active Users Info section should > be per-user > --- > > Key: YARN-7245 > URL: https://issues.apache.org/jira/browse/YARN-7245 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 2.9.0, 2.8.1, 3.0.0-alpha4 >Reporter: Eric Payne > Attachments: CapSched UI Showing Inaccurate Per User Max AM > Resource.png > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7245) In Cap Sched UI, Max AM Resource column in Active Users Info section should be per-user
Eric Payne created YARN-7245: Summary: In Cap Sched UI, Max AM Resource column in Active Users Info section should be per-user Key: YARN-7245 URL: https://issues.apache.org/jira/browse/YARN-7245 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler, yarn Affects Versions: 3.0.0-alpha4, 2.8.1, 2.9.0 Reporter: Eric Payne -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7116) CapacityScheduler Web UI: Queue's AM usage is always show on per-user's AM usage.
[ https://issues.apache.org/jira/browse/YARN-7116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-7116: - Fix Version/s: 2.8.3 > CapacityScheduler Web UI: Queue's AM usage is always show on per-user's AM > usage. > - > > Key: YARN-7116 > URL: https://issues.apache.org/jira/browse/YARN-7116 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, webapp >Affects Versions: 2.9.0, 2.8.1, 3.0.0-alpha4 >Reporter: Wangda Tan >Assignee: Wangda Tan > Fix For: 2.9.0, 3.0.0-beta1, 2.8.3 > > Attachments: YARN-7116.001.patch > > > On CapacityScheduler's web UI, AM usage of different users belong to the same > queue always shows queue's AM usage. > The root cause is: under CapacitySchedulerPage. > {code} > tbody.tr().td(userInfo.getUsername()) > .td(userInfo.getUserResourceLimit().toString()) > .td(resourcesUsed.toString()) > .td(resourceUsages.getAMLimit().toString()) > .td(amUsed.toString()) > .td(Integer.toString(userInfo.getNumActiveApplications())) > .td(Integer.toString(userInfo.getNumPendingApplications()))._(); > {code} > Instead of amUsed.toString(), it should use userInfo.getAmUsed(). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7116) CapacityScheduler Web UI: Queue's AM usage is always show on per-user's AM usage.
[ https://issues.apache.org/jira/browse/YARN-7116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16176700#comment-16176700 ] Eric Payne commented on YARN-7116: -- [~leftnoteasy], [~sunilg] If there are no objections, I'll backport this to 2.8. > CapacityScheduler Web UI: Queue's AM usage is always show on per-user's AM > usage. > - > > Key: YARN-7116 > URL: https://issues.apache.org/jira/browse/YARN-7116 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, webapp >Affects Versions: 2.9.0, 2.8.1, 3.0.0-alpha4 >Reporter: Wangda Tan >Assignee: Wangda Tan > Fix For: 2.9.0, 3.0.0-beta1 > > Attachments: YARN-7116.001.patch > > > On CapacityScheduler's web UI, AM usage of different users belong to the same > queue always shows queue's AM usage. > The root cause is: under CapacitySchedulerPage. > {code} > tbody.tr().td(userInfo.getUsername()) > .td(userInfo.getUserResourceLimit().toString()) > .td(resourcesUsed.toString()) > .td(resourceUsages.getAMLimit().toString()) > .td(amUsed.toString()) > .td(Integer.toString(userInfo.getNumActiveApplications())) > .td(Integer.toString(userInfo.getNumPendingApplications()))._(); > {code} > Instead of amUsed.toString(), it should use userInfo.getAmUsed(). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7149) Cross-queue preemption sometimes starves an underserved queue
[ https://issues.apache.org/jira/browse/YARN-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-7149: - Fix Version/s: 3.1.0 2.9.0 > Cross-queue preemption sometimes starves an underserved queue > - > > Key: YARN-7149 > URL: https://issues.apache.org/jira/browse/YARN-7149 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Eric Payne >Assignee: Eric Payne > Fix For: 2.9.0, 3.0.0-beta1, 3.1.0 > > Attachments: YARN-7149.001.patch, YARN-7149.002.patch, > YARN-7149.demo.unit-test.patch > > > In branch 2 and trunk, I am consistently seeing some use cases where > cross-queue preemption does not happen when it should. I do not see this in > branch-2.8. > Use Case: > | | *Size* | *Minimum Container Size* | > |MyCluster | 20 GB | 0.5 GB | > | *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit > Percent (MULP)* | *User Limit Factor (ULF)* | > |Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 | > |Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 | > - {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB) > - {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB > - _Note: containers are 0.5 GB._ > - Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}. > - Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}. > - _No more containers are ever preempted, even though {{Q2}} is far > underserved_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7149) Cross-queue preemption sometimes starves an underserved queue
[ https://issues.apache.org/jira/browse/YARN-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16170157#comment-16170157 ] Eric Payne commented on YARN-7149: -- Thanks a lot [~leftnoteasy]. Also, this needs to be pulled back into branch-2. I will do that if there are no objections. > Cross-queue preemption sometimes starves an underserved queue > - > > Key: YARN-7149 > URL: https://issues.apache.org/jira/browse/YARN-7149 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Eric Payne >Assignee: Eric Payne > Fix For: 3.0.0-beta1 > > Attachments: YARN-7149.001.patch, YARN-7149.002.patch, > YARN-7149.demo.unit-test.patch > > > In branch 2 and trunk, I am consistently seeing some use cases where > cross-queue preemption does not happen when it should. I do not see this in > branch-2.8. > Use Case: > | | *Size* | *Minimum Container Size* | > |MyCluster | 20 GB | 0.5 GB | > | *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit > Percent (MULP)* | *User Limit Factor (ULF)* | > |Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 | > |Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 | > - {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB) > - {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB > - _Note: containers are 0.5 GB._ > - Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}. > - Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}. > - _No more containers are ever preempted, even though {{Q2}} is far > underserved_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7149) Cross-queue preemption sometimes starves an underserved queue
[ https://issues.apache.org/jira/browse/YARN-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16168499#comment-16168499 ] Eric Payne commented on YARN-7149: -- The following unit tests are succeeding for me in my environment: {code} TestOpportunisticContainerAllocatorAMService TestZKRMStateStore TestSubmitApplicationWithRMHA {code} {{TestContainerAllocation}} was modified by this patch, and the new test is succeeding. The failure in {{TestContainerAllocation#testAMContainerAllocationWhenDNSUnavailable}} is a pre-existing issue: YARN-7044. > Cross-queue preemption sometimes starves an underserved queue > - > > Key: YARN-7149 > URL: https://issues.apache.org/jira/browse/YARN-7149 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-7149.001.patch, YARN-7149.002.patch, > YARN-7149.demo.unit-test.patch > > > In branch 2 and trunk, I am consistently seeing some use cases where > cross-queue preemption does not happen when it should. I do not see this in > branch-2.8. > Use Case: > | | *Size* | *Minimum Container Size* | > |MyCluster | 20 GB | 0.5 GB | > | *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit > Percent (MULP)* | *User Limit Factor (ULF)* | > |Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 | > |Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 | > - {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB) > - {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB > - _Note: containers are 0.5 GB._ > - Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}. > - Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}. > - _No more containers are ever preempted, even though {{Q2}} is far > underserved_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7149) Cross-queue preemption sometimes starves an underserved queue
[ https://issues.apache.org/jira/browse/YARN-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-7149: - Attachment: YARN-7149.002.patch bq. Do you think does it make sense to merge the {{YARN-7149.demo.unit-test.patch}} to your patch? Thanks [~leftnoteasy]. I spent some time looking through the test patch to make sure I understand its purpose. I think it makes sense to merge it with this change. Attaching an updated patch. > Cross-queue preemption sometimes starves an underserved queue > - > > Key: YARN-7149 > URL: https://issues.apache.org/jira/browse/YARN-7149 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-7149.001.patch, YARN-7149.002.patch, > YARN-7149.demo.unit-test.patch > > > In branch 2 and trunk, I am consistently seeing some use cases where > cross-queue preemption does not happen when it should. I do not see this in > branch-2.8. > Use Case: > | | *Size* | *Minimum Container Size* | > |MyCluster | 20 GB | 0.5 GB | > | *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit > Percent (MULP)* | *User Limit Factor (ULF)* | > |Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 | > |Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 | > - {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB) > - {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB > - _Note: containers are 0.5 GB._ > - Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}. > - Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}. > - _No more containers are ever preempted, even though {{Q2}} is far > underserved_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7149) Cross-queue preemption sometimes starves an underserved queue
[ https://issues.apache.org/jira/browse/YARN-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166286#comment-16166286 ] Eric Payne commented on YARN-7149: -- Unit test failures are not related to this patch: {{TestAbstractYarnScheduler}}: Succeeds for me locally {{TestContainerAllocation}}: YARN-7044 > Cross-queue preemption sometimes starves an underserved queue > - > > Key: YARN-7149 > URL: https://issues.apache.org/jira/browse/YARN-7149 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-7149.001.patch, YARN-7149.demo.unit-test.patch > > > In branch 2 and trunk, I am consistently seeing some use cases where > cross-queue preemption does not happen when it should. I do not see this in > branch-2.8. > Use Case: > | | *Size* | *Minimum Container Size* | > |MyCluster | 20 GB | 0.5 GB | > | *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit > Percent (MULP)* | *User Limit Factor (ULF)* | > |Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 | > |Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 | > - {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB) > - {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB > - _Note: containers are 0.5 GB._ > - Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}. > - Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}. > - _No more containers are ever preempted, even though {{Q2}} is far > underserved_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7149) Cross-queue preemption sometimes starves an underserved queue
[ https://issues.apache.org/jira/browse/YARN-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-7149: - Attachment: YARN-7149.001.patch Rather than use this JIRA to revert the {{computeUserLimit}} behavior to pre-YARN-5889, patch {{YARN-7149.001.patch}} just adds {{minimumAllocation (min container size)}} to {{resourceUsed}}. I see this as a compromise between the old and the new behavior. Please let me know your thoughts. > Cross-queue preemption sometimes starves an underserved queue > - > > Key: YARN-7149 > URL: https://issues.apache.org/jira/browse/YARN-7149 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-7149.001.patch, YARN-7149.demo.unit-test.patch > > > In branch 2 and trunk, I am consistently seeing some use cases where > cross-queue preemption does not happen when it should. I do not see this in > branch-2.8. > Use Case: > | | *Size* | *Minimum Container Size* | > |MyCluster | 20 GB | 0.5 GB | > | *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit > Percent (MULP)* | *User Limit Factor (ULF)* | > |Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 | > |Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 | > - {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB) > - {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB > - _Note: containers are 0.5 GB._ > - Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}. > - Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}. > - _No more containers are ever preempted, even though {{Q2}} is far > underserved_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4727) Unable to override the $HADOOP_CONF_DIR env variable for container
[ https://issues.apache.org/jira/browse/YARN-4727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165194#comment-16165194 ] Eric Payne commented on YARN-4727: -- +1 Thanks [~jlowe] > Unable to override the $HADOOP_CONF_DIR env variable for container > -- > > Key: YARN-4727 > URL: https://issues.apache.org/jira/browse/YARN-4727 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.4.1, 2.5.2, 2.7.2, 2.6.4, 2.8.1 >Reporter: Terence Yim >Assignee: Jason Lowe > Attachments: YARN-4727.001.patch, YARN-4727.002.patch > > > Given the default config of "yarn.nodemanager.env-whitelist", application > should be able to set the env variable $HADOOP_CONF_DIR to value other than > the one in the NodeManager system environment. However, I believe due to a > bug in the > {{org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch}} > class, it is not possible so. > From the {{sanitizeEnv()}} method in the ContainerLaunch class > (https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java#L977) > {noformat} > putEnvIfNotNull(environment, > Environment.HADOOP_CONF_DIR.name(), > System.getenv(Environment.HADOOP_CONF_DIR.name()) > ); > if (!Shell.WINDOWS) { > environment.put("JVM_PID", "$$"); > } > String[] whitelist = conf.get(YarnConfiguration.NM_ENV_WHITELIST, > YarnConfiguration.DEFAULT_NM_ENV_WHITELIST).split(","); > > for(String whitelistEnvVariable : whitelist) { > putEnvIfAbsent(environment, whitelistEnvVariable.trim()); > } > ... > private static void putEnvIfAbsent( > Mapenvironment, String variable) { > if (environment.get(variable) == null) { > putEnvIfNotNull(environment, variable, System.getenv(variable)); > } > } > {noformat} > So there two issues here. > 1. the environment is already set with the system environment of the NM in > the {{putEnvIfNotNull}} call, hence the {{putEnvIfAbsent}} call will never > set it to some new value > 2. Inside the {{putEnvIfAbsent}} call, it uses the system environment of the > NM, which it should be using the one from the {{launchContext}} instead. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7149) Cross-queue preemption sometimes starves an underserved queue
[ https://issues.apache.org/jira/browse/YARN-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16161546#comment-16161546 ] Eric Payne commented on YARN-7149: -- bq. You could check the unit test code to see if that matches your expectation. I see that the patch for YARN-5889 needed to change the headroom usage in {{TestLeafQueue}} for Assersions in {{testComputeUserLimitAndSetHeadroom}} and {{testHeadroomWithMaxCap}}: {code} @@ -1123,9 +1129,9 @@ public void testComputeUserLimitAndSetHeadroom() throws IOException { //testcase3 still active - 2+2+6=10 assertEquals(10*GB, qb.getUsedResources().getMemorySize()); //app4 is user 0 -//maxqueue 16G, userlimit 13G, used 8G, headroom 5G +//maxqueue 16G, userlimit 7G, used 8G, headroom 5G //(8G used is 6G from this test case - app4, 2 from last test case, app_1) -assertEquals(5*GB, app_4.getHeadroom().getMemorySize()); +assertEquals(0*GB, app_4.getHeadroom().getMemorySize()); } @Test @@ -1309,8 +1315,8 @@ public void testHeadroomWithMaxCap() throws Exception { assertEquals(2*GB, app_0.getCurrentConsumption().getMemorySize()); assertEquals(0*GB, app_1.getCurrentConsumption().getMemorySize()); // TODO, fix headroom in the future patch -assertEquals(1*GB, app_0.getHeadroom().getMemorySize()); - // User limit = 4G, 2 in use +assertEquals(0*GB, app_0.getHeadroom().getMemorySize()); + // User limit = 2G, 2 in use assertEquals(0*GB, app_1.getHeadroom().getMemorySize()); // the application is not yet active @@ -1322,15 +1328,15 @@ public void testHeadroomWithMaxCap() throws Exception { assertEquals(3*GB, a.getUsedResources().getMemorySize()); assertEquals(2*GB, app_0.getCurrentConsumption().getMemorySize()); assertEquals(1*GB, app_1.getCurrentConsumption().getMemorySize()); -assertEquals(1*GB, app_0.getHeadroom().getMemorySize()); // 4G - 3G -assertEquals(1*GB, app_1.getHeadroom().getMemorySize()); // 4G - 3G +assertEquals(0*GB, app_0.getHeadroom().getMemorySize()); // 4G - 3G +assertEquals(0*GB, app_1.getHeadroom().getMemorySize()); // 4G - 3G // Submit requests for app_1 and set max-cap a.setMaxCapacity(.1f); app_2.updateResourceRequests(Collections.singletonList( {code} > Cross-queue preemption sometimes starves an underserved queue > - > > Key: YARN-7149 > URL: https://issues.apache.org/jira/browse/YARN-7149 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-7149.demo.unit-test.patch > > > In branch 2 and trunk, I am consistently seeing some use cases where > cross-queue preemption does not happen when it should. I do not see this in > branch-2.8. > Use Case: > | | *Size* | *Minimum Container Size* | > |MyCluster | 20 GB | 0.5 GB | > | *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit > Percent (MULP)* | *User Limit Factor (ULF)* | > |Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 | > |Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 | > - {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB) > - {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB > - _Note: containers are 0.5 GB._ > - Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}. > - Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}. > - _No more containers are ever preempted, even though {{Q2}} is far > underserved_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6248) user is not removed from UsersManager’s when app is killed with pending container requests.
[ https://issues.apache.org/jira/browse/YARN-6248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-6248: - Fix Version/s: 2.9.0 > user is not removed from UsersManager’s when app is killed with pending > container requests. > --- > > Key: YARN-6248 > URL: https://issues.apache.org/jira/browse/YARN-6248 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0-alpha4 >Reporter: Eric Payne >Assignee: Eric Payne > Fix For: 2.9.0, 3.0.0-alpha4 > > Attachments: User Left Over.jpg, YARN-6248.001.patch > > > If an app is still asking for resources when it is killed, the user is left > in the UsersManager structure and shows up on the GUI. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6248) user is not removed from UsersManager’s when app is killed with pending container requests.
[ https://issues.apache.org/jira/browse/YARN-6248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16157522#comment-16157522 ] Eric Payne commented on YARN-6248: -- I'm seeing this in branch-2 (2.9.0) as well. I will backport. > user is not removed from UsersManager’s when app is killed with pending > container requests. > --- > > Key: YARN-6248 > URL: https://issues.apache.org/jira/browse/YARN-6248 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0-alpha4 >Reporter: Eric Payne >Assignee: Eric Payne > Fix For: 3.0.0-alpha4 > > Attachments: User Left Over.jpg, YARN-6248.001.patch > > > If an app is still asking for resources when it is killed, the user is left > in the UsersManager structure and shows up on the GUI. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7149) Cross-queue preemption sometimes starves an underserved queue
[ https://issues.apache.org/jira/browse/YARN-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16157471#comment-16157471 ] Eric Payne commented on YARN-7149: -- bq. Yes you're correct, The max op is consistent with old behavior, we don't need to change it to min. [~leftnoteasy], Sorry, but I'm still confused about what behavior is desired. IMHO, the old behavior was more consistent with the expectations of the MULP in a capacity scheduler. That is, the first users with asking apps are elevated to their user limit as quickly as possible in a FIFO order. So, the thing I'm confused about is what the use case would be for raising all asking users more evenly in a capacity scheduler context. It seems to me that the latter could sometimes prevent any user from achieving its user limit. Thanks! > Cross-queue preemption sometimes starves an underserved queue > - > > Key: YARN-7149 > URL: https://issues.apache.org/jira/browse/YARN-7149 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-7149.demo.unit-test.patch > > > In branch 2 and trunk, I am consistently seeing some use cases where > cross-queue preemption does not happen when it should. I do not see this in > branch-2.8. > Use Case: > | | *Size* | *Minimum Container Size* | > |MyCluster | 20 GB | 0.5 GB | > | *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit > Percent (MULP)* | *User Limit Factor (ULF)* | > |Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 | > |Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 | > - {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB) > - {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB > - _Note: containers are 0.5 GB._ > - Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}. > - Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}. > - _No more containers are ever preempted, even though {{Q2}} is far > underserved_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7149) Cross-queue preemption sometimes starves an underserved queue
[ https://issues.apache.org/jira/browse/YARN-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16157263#comment-16157263 ] Eric Payne commented on YARN-7149: -- Thanks very much for your insights [~leftnoteasy]. bq. When we have two active users in the queue, and userLimit set to 100, first user will always get preferred until queue reaches maxCapacity. I assume {{userLimit}} means {{minimum-user-limit-percent}}, correct? If so, then shouldn't the above statement be "first user will always get preferred until queue reaches {{Capacity * user-limit-factor}}"? If my assumptions are correct, then I think this is exactly the behavior we want. If a queue has a MULP of 100%, then by-definition only the user with the first active app gets resources. Can you please elaborate on this use case? > Cross-queue preemption sometimes starves an underserved queue > - > > Key: YARN-7149 > URL: https://issues.apache.org/jira/browse/YARN-7149 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Eric Payne >Assignee: Eric Payne > Attachments: YARN-7149.demo.unit-test.patch > > > In branch 2 and trunk, I am consistently seeing some use cases where > cross-queue preemption does not happen when it should. I do not see this in > branch-2.8. > Use Case: > | | *Size* | *Minimum Container Size* | > |MyCluster | 20 GB | 0.5 GB | > | *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit > Percent (MULP)* | *User Limit Factor (ULF)* | > |Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 | > |Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 | > - {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB) > - {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB > - _Note: containers are 0.5 GB._ > - Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}. > - Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}. > - _No more containers are ever preempted, even though {{Q2}} is far > underserved_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7149) Cross-queue preemption sometimes starves an underserved queue
[ https://issues.apache.org/jira/browse/YARN-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155872#comment-16155872 ] Eric Payne commented on YARN-7149: -- bq. I would like to still pursue Jason Lowe's suggestion about reverting to the pre-YARN-5889 behavior. I think I should clarify this. I think we are specifically talking about reverting the piece of code in {{computeUserLimit}} that boils (way) down to (rouphly): {code:title=OLD} userLimit = (queueAllUsedResources < queueGuaranteedResources) ? (queueGuaranteedResources / #activeUsers) : ((queueAllUsedResources + minContainerSize) / #activeUsers) {code} {code:title=NEW} userLimit = (queueResourcesUsedByActiveUsers / #activeUsers) {code} > Cross-queue preemption sometimes starves an underserved queue > - > > Key: YARN-7149 > URL: https://issues.apache.org/jira/browse/YARN-7149 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Eric Payne >Assignee: Eric Payne > > In branch 2 and trunk, I am consistently seeing some use cases where > cross-queue preemption does not happen when it should. I do not see this in > branch-2.8. > Use Case: > | | *Size* | *Minimum Container Size* | > |MyCluster | 20 GB | 0.5 GB | > | *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit > Percent (MULP)* | *User Limit Factor (ULF)* | > |Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 | > |Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 | > - {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB) > - {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB > - _Note: containers are 0.5 GB._ > - Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}. > - Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}. > - _No more containers are ever preempted, even though {{Q2}} is far > underserved_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7149) Cross-queue preemption sometimes starves an underserved queue
[ https://issues.apache.org/jira/browse/YARN-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155865#comment-16155865 ] Eric Payne commented on YARN-7149: -- Thanks for your insights, [~sunilg]. bq. Here I think we have to use {{getResourceLimitForAllUsers}} instead of {{getResourceLimitForActiveUsers}} No, I don't think so. {{LeafQueue#getTotalPendingResourcesConsideringUserLimit}} is called during pre-preemption when attempting to calculate the total amount of resources that would be assigned to a queue if resources were to suddenly free up (by preemption). When these resources are freed up, the scheduler will use {{getResourceLimitForActiveUsers}} when deciding the amount of these resources to give to the queue. They should both use {{getResourceLimitForActiveUsers}}, which is calculating user limit for active users only. I would like to still pursue [~jlowe]'s suggestion about reverting to the pre-YARN-5889 behavior. > Cross-queue preemption sometimes starves an underserved queue > - > > Key: YARN-7149 > URL: https://issues.apache.org/jira/browse/YARN-7149 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Eric Payne >Assignee: Eric Payne > > In branch 2 and trunk, I am consistently seeing some use cases where > cross-queue preemption does not happen when it should. I do not see this in > branch-2.8. > Use Case: > | | *Size* | *Minimum Container Size* | > |MyCluster | 20 GB | 0.5 GB | > | *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit > Percent (MULP)* | *User Limit Factor (ULF)* | > |Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 | > |Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 | > - {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB) > - {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB > - _Note: containers are 0.5 GB._ > - Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}. > - Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}. > - _No more containers are ever preempted, even though {{Q2}} is far > underserved_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7149) Cross-queue preemption sometimes starves an underserved queue
[ https://issues.apache.org/jira/browse/YARN-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16153529#comment-16153529 ] Eric Payne commented on YARN-7149: -- [~sunilg], [~leftnoteasy], and [~jlowe], I would be interested in your thoughts on this JIRA. Thanks! > Cross-queue preemption sometimes starves an underserved queue > - > > Key: YARN-7149 > URL: https://issues.apache.org/jira/browse/YARN-7149 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Eric Payne >Assignee: Eric Payne > > In branch 2 and trunk, I am consistently seeing some use cases where > cross-queue preemption does not happen when it should. I do not see this in > branch-2.8. > Use Case: > | | *Size* | *Minimum Container Size* | > |MyCluster | 20 GB | 0.5 GB | > | *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit > Percent (MULP)* | *User Limit Factor (ULF)* | > |Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 | > |Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 | > - {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB) > - {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB > - _Note: containers are 0.5 GB._ > - Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}. > - Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}. > - _No more containers are ever preempted, even though {{Q2}} is far > underserved_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7149) Cross-queue preemption sometimes starves an underserved queue
[ https://issues.apache.org/jira/browse/YARN-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16151071#comment-16151071 ] Eric Payne commented on YARN-7149: -- I believe that this is related to the refactoring done to the way User Limit is calculated in trunk / branch-2. I cannot reproduce the above use case prior to YARN-5889. I think the reason is because prior to YARN-5889 (and in 2.8), the result of {{LeafQueue#computUserLimit}} was always greater than the amount used by any given user. After YARN-5889, the return value of {{UsersManager#computeUserLimit}} can be equal to the amount used by any given user. Then, in {{LeafQueue#getTotalPendingResourcesConsideringUserLimit}}, we can have a situation where {{userLimit}} - {{user.getUsed}} always equals 0, even when it shouldn't. > Cross-queue preemption sometimes starves an underserved queue > - > > Key: YARN-7149 > URL: https://issues.apache.org/jira/browse/YARN-7149 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 2.9.0, 3.0.0-alpha3 >Reporter: Eric Payne >Assignee: Eric Payne > > In branch 2 and trunk, I am consistently seeing some use cases where > cross-queue preemption does not happen when it should. I do not see this in > branch-2.8. > Use Case: > | | *Size* | *Minimum Container Size* | > |MyCluster | 20 GB | 0.5 GB | > | *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit > Percent (MULP)* | *User Limit Factor (ULF)* | > |Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 | > |Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 | > - {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB) > - {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB > - _Note: containers are 0.5 GB._ > - Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}. > - Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}. > - _No more containers are ever preempted, even though {{Q2}} is far > underserved_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7149) Cross-queue preemption sometimes starves an underserved queue
Eric Payne created YARN-7149: Summary: Cross-queue preemption sometimes starves an underserved queue Key: YARN-7149 URL: https://issues.apache.org/jira/browse/YARN-7149 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler Affects Versions: 3.0.0-alpha3, 2.9.0 Reporter: Eric Payne Assignee: Eric Payne In branch 2 and trunk, I am consistently seeing some use cases where cross-queue preemption does not happen when it should. I do not see this in branch-2.8. Use Case: | | *Size* | *Minimum Container Size* | |MyCluster | 20 GB | 0.5 GB | | *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit Percent (MULP)* | *User Limit Factor (ULF)* | |Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 | |Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 | - {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB) - {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB - _Note: containers are 0.5 GB._ - Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}. - Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}. - _No more containers are ever preempted, even though {{Q2}} is far underserved_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7120) CapacitySchedulerPage NPE in "Aggregate scheduler counts" section
[ https://issues.apache.org/jira/browse/YARN-7120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-7120: - Attachment: YARN-7120.001.patch > CapacitySchedulerPage NPE in "Aggregate scheduler counts" section > - > > Key: YARN-7120 > URL: https://issues.apache.org/jira/browse/YARN-7120 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 2.8.1, 3.0.0-alpha3 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Minor > Attachments: Aggregate Scheduler Counts All Sections.png, Aggregate > Scheduler Counts Bottom Cut Off.png, YARN-7120.001.patch > > > The problem manifests itself by having the bottom part of the "Aggregated > scheduler counts" section cut off on the GUI and an NPE in the RM log. > {noformat} > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$HealthBlock.render(CapacitySchedulerPage.java:558) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43) > at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet.__(Hamlet.java:30354) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:478) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) > at > org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848) > at > org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71) > at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) > at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.scheduler(RmController.java:86) > ... 58 more > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7120) CapacitySchedulerPage NPE in "Aggregate scheduler counts" section
[ https://issues.apache.org/jira/browse/YARN-7120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-7120: - Attachment: Aggregate Scheduler Counts All Sections.png Aggregate Scheduler Counts Bottom Cut Off.png > CapacitySchedulerPage NPE in "Aggregate scheduler counts" section > - > > Key: YARN-7120 > URL: https://issues.apache.org/jira/browse/YARN-7120 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 2.8.1, 3.0.0-alpha3 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Minor > Attachments: Aggregate Scheduler Counts All Sections.png, Aggregate > Scheduler Counts Bottom Cut Off.png > > > The problem manifests itself by having the bottom part of the "Aggregated > scheduler counts" section cut off on the GUI and an NPE in the RM log. > {noformat} > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$HealthBlock.render(CapacitySchedulerPage.java:558) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43) > at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet.__(Hamlet.java:30354) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:478) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > at > org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) > at > org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) > at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848) > at > org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71) > at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) > at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.scheduler(RmController.java:86) > ... 58 more > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7120) CapacitySchedulerPage NPE in "Aggregate scheduler counts" section
Eric Payne created YARN-7120: Summary: CapacitySchedulerPage NPE in "Aggregate scheduler counts" section Key: YARN-7120 URL: https://issues.apache.org/jira/browse/YARN-7120 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0-alpha3, 2.8.1, 2.9.0 Reporter: Eric Payne Assignee: Eric Payne Priority: Minor The problem manifests itself by having the bottom part of the "Aggregated scheduler counts" section cut off on the GUI and an NPE in the RM log. {noformat} Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$HealthBlock.render(CapacitySchedulerPage.java:558) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) at org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43) at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet.__(Hamlet.java:30354) at org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:478) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) at org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49) at org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117) at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848) at org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71) at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82) at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.scheduler(RmController.java:86) ... 58 more {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7051) Avoid concurrent modification exception in FifoIntraQueuePreemptionPlugin
[ https://issues.apache.org/jira/browse/YARN-7051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143830#comment-16143830 ] Eric Payne commented on YARN-7051: -- Thanks [~sunilg] for your help in resolving this problem. > Avoid concurrent modification exception in FifoIntraQueuePreemptionPlugin > - > > Key: YARN-7051 > URL: https://issues.apache.org/jira/browse/YARN-7051 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler preemption, yarn >Affects Versions: 2.9.0, 2.8.1, 3.0.0-alpha3 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Critical > Fix For: 2.9.0, 3.0.0-beta1, 2.8.2 > > Attachments: YARN-7051.001.patch, YARN-7051.002.patch > > > {{FifoIntraQueuePreemptionPlugin#calculateUsedAMResourcesPerQueue}} has the > following code: > {code} > Collection runningApps = leafQueue.getApplications(); > Resource amUsed = Resources.createResource(0, 0); > for (FiCaSchedulerApp app : runningApps) { > {code} > {{runningApps}} is unmodifiable but not concurrent. This caused the > preemption monitor thread to crash in the RM in one of our clusters. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7087) NM failed to perform log aggregation due to absent container
[ https://issues.apache.org/jira/browse/YARN-7087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142140#comment-16142140 ] Eric Payne commented on YARN-7087: -- [~jlowe], thanks for finding, reporting, and fixing this issue. +1. The patch LGTM. I will commit this afternoon. > NM failed to perform log aggregation due to absent container > > > Key: YARN-7087 > URL: https://issues.apache.org/jira/browse/YARN-7087 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation >Affects Versions: 2.8.1 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Blocker > Attachments: YARN-7087.001.patch, YARN-7087.002.patch > > > Saw a case where the NM failed to aggregate the logs for a container because > it claimed it was absent: > {noformat} > 2017-08-23 18:35:38,283 [AsyncDispatcher event handler] WARN > logaggregation.LogAggregationService: Log aggregation cannot be started for > container_e07_1503326514161_502342_01_01, as its an absent container > {noformat} > Containers should not be allowed to disappear if they're not done being fully > processed by the NM. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7052) RM SchedulingMonitor gives no indication why the spawned thread crashed.
[ https://issues.apache.org/jira/browse/YARN-7052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141746#comment-16141746 ] Eric Payne commented on YARN-7052: -- The following unit tests are all passing for me in my environment: {noformat} org.apache.hadoop.yarn.server.resourcemanager.TestRMStoreCommands org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA org.apache.hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA org.apache.hadoop.yarn.server.resourcemanager.TestRMHAForNodeLabels {noformat} The {{TestContainerAllocation}} unit test is the same as YARN-7044 > RM SchedulingMonitor gives no indication why the spawned thread crashed. > > > Key: YARN-7052 > URL: https://issues.apache.org/jira/browse/YARN-7052 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Critical > Attachments: YARN-7052.001.patch > > > In YARN-7051, we ran into a case where the preemption monitor thread hung > with no indication of why. > The preemption monitor is started by the {{SchedulingExecutorService}} from > {{SchedulingMonitor#serviceStart}}. Once an uncaught throwable happens, > nothing ever gets the result of the future, the thread running the preemption > monitor never dies, and it never gets rescheduled. > If {{HadoopExecutor}} were used, it would at least provide a > {{HadoopScheduledThreadPoolExecutor}} that logs the exception if one happens. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7051) FifoIntraQueuePreemptionPlugin can get concurrent modification exception
[ https://issues.apache.org/jira/browse/YARN-7051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141684#comment-16141684 ] Eric Payne commented on YARN-7051: -- Unit test failures are not related to this patch. - {{TestContainerAllocation}}: YARN-7044 - {{h.y.s.r.security.TestDelegationTokenRenewer}}: YARN-5816 {code:title=LeafQueue#getAllApplications} public Collection getAllApplications() { Collection apps = new HashSet( pendingOrderingPolicy.getSchedulableEntities()); apps.addAll(orderingPolicy.getSchedulableEntities()); return Collections.unmodifiableCollection(apps); } {code} bq. {{getAllApplications}} in {{LeafQueue}} then has to be under readlock also, correct? Possibly. It looks like {{HashSet#addAll}} will iterate through {{orderingPolicy}}, which could possibly change during the loop. However, I would like to have that discussion on a separate JIRA since I may be misinterpreting how {{addAll}} works and since the usage of {{getAAllApplications}} affects more than just preemption. > FifoIntraQueuePreemptionPlugin can get concurrent modification exception > > > Key: YARN-7051 > URL: https://issues.apache.org/jira/browse/YARN-7051 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler preemption, yarn >Affects Versions: 2.9.0, 2.8.1, 3.0.0-alpha3 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Critical > Attachments: YARN-7051.001.patch, YARN-7051.002.patch > > > {{FifoIntraQueuePreemptionPlugin#calculateUsedAMResourcesPerQueue}} has the > following code: > {code} > Collection runningApps = leafQueue.getApplications(); > Resource amUsed = Resources.createResource(0, 0); > for (FiCaSchedulerApp app : runningApps) { > {code} > {{runningApps}} is unmodifiable but not concurrent. This caused the > preemption monitor thread to crash in the RM in one of our clusters. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7051) FifoIntraQueuePreemptionPlugin can get concurrent modification exception
[ https://issues.apache.org/jira/browse/YARN-7051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-7051: - Attachment: YARN-7051.002.patch bq. so this won't be changing while createTempAppForResCalculation is looping over the list. However, I did find a race condition that throws an NPE within {{createTempAppForResCalculation}}. {noformat} java.lang.NullPointerException at org.apache.hadoop.yarn.util.resource.Resources.clone(Resources.java:155) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoIntraQueuePreemptionPlugin.createTempAppForResCalculation(FifoIntraQueuePreemptionPlugin.java:403) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoIntraQueuePreemptionPlugin.computeAppsIdealAllocation(FifoIntraQueuePreemptionPlugin.java:140) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.IntraQueueCandidatesSelector.computeIntraQueuePreemptionDemand(IntraQueueCandidatesSelector.java:283) {noformat} The reason for this is that {{perUserAMUsed}} was populated with running apps prior to calling {{createTempAppForResCalculation}}, but then {{createTempAppForResCalculation}} loops through both running and pending apps. Attaching new patch that addresses this. > FifoIntraQueuePreemptionPlugin can get concurrent modification exception > > > Key: YARN-7051 > URL: https://issues.apache.org/jira/browse/YARN-7051 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, scheduler preemption, yarn >Affects Versions: 2.9.0, 2.8.1, 3.0.0-alpha3 >Reporter: Eric Payne >Assignee: Eric Payne >Priority: Critical > Attachments: YARN-7051.001.patch, YARN-7051.002.patch > > > {{FifoIntraQueuePreemptionPlugin#calculateUsedAMResourcesPerQueue}} has the > following code: > {code} > Collection runningApps = leafQueue.getApplications(); > Resource amUsed = Resources.createResource(0, 0); > for (FiCaSchedulerApp app : runningApps) { > {code} > {{runningApps}} is unmodifiable but not concurrent. This caused the > preemption monitor thread to crash in the RM in one of our clusters. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org