[jira] [Updated] (YARN-7619) Max AM Resource value in CS UI is different for every user

2017-12-12 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-7619:
-
Attachment: YARN-7619.001.patch

Uploading patch 001. This is not a perfect solution, but it's close. The 
pre-weighted AM limit for all users in a particular queue is calculated in 
{{LeafQueue#getUserAMResourceLimitPerPartition}} and passed to the UI via the 
{{UserInfo}} object for each user when the UI is rendered. This is a little 
awkward because the AM Limit for users in the queue is a per-queue value, but 
when rendering, I wanted to multiply the value by each users' weight. 

The value displayed on the UI in the Max AM Resource may not always be valid 
for weighted users because it is not normalized, and it may be more than the 
queue-level AM limit on the high end if the weight is large. But since this is 
only for display purposes, I think it's acceptable.

> Max AM Resource value in CS UI is different for every user
> --
>
> Key: YARN-7619
> URL: https://issues.apache.org/jira/browse/YARN-7619
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.9.0, 3.0.0-beta1, 2.8.2, 3.1.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: Max AM Resources is Different for Each User.png, 
> YARN-7619.001.patch
>
>
> YARN-7245 addressed the problem that the {{Max AM Resource}} in the capacity 
> scheduler UI used to contain the queue-level AM limit instead of the 
> user-level AM limit. It fixed this by using the user-specific AM limit that 
> is calculated in {{LeafQueue#activateApplications}}, stored in each user's 
> {{LeafQueue#User}} object, and retrieved via 
> {{UserInfo#getResourceUsageInfo}}.
> The problem is that this user-specific AM limit depends on the activity of 
> other users and other applications in a queue, and it is only calculated and 
> updated when a user's application is activated. So, when 
> {{CapacitySchedulerPage}} retrieves the user-specific AM limit, it is a stale 
> value unless an application was recently activated for a particular user.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7619) Max AM Resource value in CS UI is different for every user

2017-12-07 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282707#comment-16282707
 ] 

Eric Payne commented on YARN-7619:
--

All of the other solutions I could think of seem undesirable.

One solution would is to have {{LeafQueue}} remember the last user for which it 
activated an application. The resource usages for that user are passed through 
the {{UserInfo}} object to {{CapacitySchedulerPage}}, which then extracts the 
last activated user's AM limit from those usages. This is not ideal, because it 
doesn't take into account user weights. So, if the last activated user has a 
weight not equal to 1.0, the AM limit may be wrong for some users. (_On a side 
note, user weights do not look to be affecting user AM limits even though 
{{LeafQueue#getUserAMResourceLimitPerPartition}} seems to be computing the 
limit using user weights_). Also, if the last activated user leaves the queue, 
we have to use each users' AM limit, which puts us back where we started.

Another solution may be to have {{UsersManager}} sort the users list to be in 
last-activated-first order. Then, when 
{{CapacitySchedulerPage#QueueUsersInfoBlock}} is rendering the users info 
block, it could just get the user AM limit from the first uesr. That's what 
{{CapacitySchedulerPage#LeafQueueInfoBlock}} does when it's retrieving the 
value for *Max Application Master Resources Per User*. It just expects the 
first one to be the correct one for all the users in the queue.

Ideally, I would say it would be best to save the recomputed user AM limit to 
all users objects whenever {{LeafQueue#getUserAMResourceLimitPerPartition}} is 
called, but that may cause a significant performance hit. Even so, I think this 
option is the cleanest and the performance hit may not be that bad.

> Max AM Resource value in CS UI is different for every user
> --
>
> Key: YARN-7619
> URL: https://issues.apache.org/jira/browse/YARN-7619
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.9.0, 3.0.0-beta1, 2.8.2, 3.1.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: Max AM Resources is Different for Each User.png
>
>
> YARN-7245 addressed the problem that the {{Max AM Resource}} in the capacity 
> scheduler UI used to contain the queue-level AM limit instead of the 
> user-level AM limit. It fixed this by using the user-specific AM limit that 
> is calculated in {{LeafQueue#activateApplications}}, stored in each user's 
> {{LeafQueue#User}} object, and retrieved via 
> {{UserInfo#getResourceUsageInfo}}.
> The problem is that this user-specific AM limit depends on the activity of 
> other users and other applications in a queue, and it is only calculated and 
> updated when a user's application is activated. So, when 
> {{CapacitySchedulerPage}} retrieves the user-specific AM limit, it is a stale 
> value unless an application was recently activated for a particular user.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7619) Max AM Resource value in CS UI is different for every user

2017-12-06 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-7619:
-
Attachment: Max AM Resources is Different for Each User.png

> Max AM Resource value in CS UI is different for every user
> --
>
> Key: YARN-7619
> URL: https://issues.apache.org/jira/browse/YARN-7619
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.9.0, 3.0.0-beta1, 2.8.2, 3.1.0
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: Max AM Resources is Different for Each User.png
>
>
> YARN-7245 addressed the problem that the {{Max AM Resource}} in the capacity 
> scheduler UI used to contain the queue-level AM limit instead of the 
> user-level AM limit. It fixed this by using the user-specific AM limit that 
> is calculated in {{LeafQueue#activateApplications}}, stored in each user's 
> {{LeafQueue#User}} object, and retrieved via 
> {{UserInfo#getResourceUsageInfo}}.
> The problem is that this user-specific AM limit depends on the activity of 
> other users and other applications in a queue, and it is only calculated and 
> updated when a user's application is activated. So, when 
> {{CapacitySchedulerPage}} retrieves the user-specific AM limit, it is a stale 
> value unless an application was recently activated for a particular user.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7619) Max AM Resource value in CS UI is different for every user

2017-12-06 Thread Eric Payne (JIRA)
Eric Payne created YARN-7619:


 Summary: Max AM Resource value in CS UI is different for every user
 Key: YARN-7619
 URL: https://issues.apache.org/jira/browse/YARN-7619
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler, yarn
Affects Versions: 3.0.0-beta1, 2.9.0, 2.8.2, 3.1.0
Reporter: Eric Payne
Assignee: Eric Payne


YARN-7245 addressed the problem that the {{Max AM Resource}} in the capacity 
scheduler UI used to contain the queue-level AM limit instead of the user-level 
AM limit. It fixed this by using the user-specific AM limit that is calculated 
in {{LeafQueue#activateApplications}}, stored in each user's {{LeafQueue#User}} 
object, and retrieved via {{UserInfo#getResourceUsageInfo}}.

The problem is that this user-specific AM limit depends on the activity of 
other users and other applications in a queue, and it is only calculated and 
updated when a user's application is activated. So, when 
{{CapacitySchedulerPage}} retrieves the user-specific AM limit, it is a stale 
value unless an application was recently activated for a particular user.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6124) Make SchedulingEditPolicy can be enabled / disabled / updated with RMAdmin -refreshQueues

2017-12-03 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16276200#comment-16276200
 ] 

Eric Payne commented on YARN-6124:
--

Sorry for the delay. Belatedly, I am fine with the patch.

> Make SchedulingEditPolicy can be enabled / disabled / updated with RMAdmin 
> -refreshQueues
> -
>
> Key: YARN-6124
> URL: https://issues.apache.org/jira/browse/YARN-6124
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Assignee: Zian Chen
> Fix For: 3.1.0
>
> Attachments: YARN-6124.4.patch, YARN-6124.5.patch, YARN-6124.6.patch, 
> YARN-6124.wip.1.patch, YARN-6124.wip.2.patch, YARN-6124.wip.3.patch
>
>
> Now enabled / disable / update SchedulingEditPolicy config requires restart 
> RM. This is inconvenient when admin wants to make changes to 
> SchedulingEditPolicies.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7575) When using absolute capacity configuration with no max capacity, scheduler UI NPEs and can't grow queue

2017-11-28 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16269455#comment-16269455
 ] 

Eric Payne commented on YARN-7575:
--

Sorry, my bad. My ULF is set to 2.0 on the default queue. After setting it to 
3.0, my use case works.

On the plus side, we know that ULF works as expected with absolute capacity :)

+1 on on the patch. Thanks [~sunilg]

> When using absolute capacity configuration with no max capacity, scheduler UI 
> NPEs and can't grow queue
> ---
>
> Key: YARN-7575
> URL: https://issues.apache.org/jira/browse/YARN-7575
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Eric Payne
> Attachments: YARN-7575-YARN-5881.001.patch
>
>
> I encountered the following while reviewing and testing branch YARN-5881.
> The design document from YARN-5881 says that for max-capacity:
> {quote}
> 3)  For each queue, we require:
> a) if max-resource not set, it automatically set to parent.max-resource
> {quote}
> When I try leaving blank {{yarn.scheduler.capacity.< 
> queue-path>.maximum-capacity}}, the RMUI scheduler page refuses to render. It 
> looks like it's in {{CapacitySchedulerPage$ LeafQueueInfoBlock}}:
> {noformat}
> 2017-11-28 11:29:16,974 [qtp43473566-220] ERROR webapp.Dispatcher: error 
> handling URI: /cluster/scheduler
> java.lang.reflect.InvocationTargetException
> ...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(CapacitySchedulerPage.java:164)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderLeafQueueInfoWithoutParition(CapacitySchedulerPage.java:129)
> {noformat}
> Also... A job will run in the leaf queue with no max capacity set and it will 
> grow to the max capacity of the cluster, but if I add resources to the node, 
> the job won't grow any more even though it has pending resources.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7575) When using absolute capacity configuration with no max capacity, scheduler UI NPEs and can't grow queue

2017-11-28 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16269175#comment-16269175
 ] 

Eric Payne commented on YARN-7575:
--

[~sunilg], the fix for the UI NPE looks good, but the other problem I'm having 
is that when I increase a node size, the queue doesn't grow.

My configs are as follows:
- 4 node managers, 5120GB and 10 Vcores each for a total of [20480GB, 40 VCores]
- {{yarn.scheduler.capacity.root.default.capacity}}: [memory=10240,vcores=20]
- {{yarn.scheduler.capacity.root.eng.capacity}}: [memory=10240,vcores=20]
- Note that I do not set root.capacity, nor do I set any maximum-capacity.

My use case is as follows:
- I start a job requesting 22.5GB and 45 vcores (container size=0.5GB)
- the job consumes 20GB and 40 vcores
- I add 2.5GB and 5 vcores to one of the nodes:
{{yarn rmadmin -updateNodeResource host:port 7680 15}}
- One more container is assigned to the job, but that only brings the job to 
20.5GB and 41 vcores.


> When using absolute capacity configuration with no max capacity, scheduler UI 
> NPEs and can't grow queue
> ---
>
> Key: YARN-7575
> URL: https://issues.apache.org/jira/browse/YARN-7575
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Eric Payne
> Attachments: YARN-7575-YARN-5881.001.patch
>
>
> I encountered the following while reviewing and testing branch YARN-5881.
> The design document from YARN-5881 says that for max-capacity:
> {quote}
> 3)  For each queue, we require:
> a) if max-resource not set, it automatically set to parent.max-resource
> {quote}
> When I try leaving blank {{yarn.scheduler.capacity.< 
> queue-path>.maximum-capacity}}, the RMUI scheduler page refuses to render. It 
> looks like it's in {{CapacitySchedulerPage$ LeafQueueInfoBlock}}:
> {noformat}
> 2017-11-28 11:29:16,974 [qtp43473566-220] ERROR webapp.Dispatcher: error 
> handling URI: /cluster/scheduler
> java.lang.reflect.InvocationTargetException
> ...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(CapacitySchedulerPage.java:164)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderLeafQueueInfoWithoutParition(CapacitySchedulerPage.java:129)
> {noformat}
> Also... A job will run in the leaf queue with no max capacity set and it will 
> grow to the max capacity of the cluster, but if I add resources to the node, 
> the job won't grow any more even though it has pending resources.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7575) When using absolute capacity configuration with no max capacity, scheduler UI NPEs and can't grow queue

2017-11-28 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-7575:
-
Issue Type: Sub-task  (was: Bug)
Parent: YARN-5881

> When using absolute capacity configuration with no max capacity, scheduler UI 
> NPEs and can't grow queue
> ---
>
> Key: YARN-7575
> URL: https://issues.apache.org/jira/browse/YARN-7575
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Eric Payne
>
> I encountered the following while reviewing and testing branch YARN-5881.
> The design document from YARN-5881 says that for max-capacity:
> {quote}
> 3)  For each queue, we require:
> a) if max-resource not set, it automatically set to parent.max-resource
> {quote}
> When I try leaving blank {{yarn.scheduler.capacity.< 
> queue-path>.maximum-capacity}}, the RMUI scheduler page refuses to render. It 
> looks like it's in {{CapacitySchedulerPage$ LeafQueueInfoBlock}}:
> {noformat}
> 2017-11-28 11:29:16,974 [qtp43473566-220] ERROR webapp.Dispatcher: error 
> handling URI: /cluster/scheduler
> java.lang.reflect.InvocationTargetException
> ...
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(CapacitySchedulerPage.java:164)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderLeafQueueInfoWithoutParition(CapacitySchedulerPage.java:129)
> {noformat}
> Also... A job will run in the leaf queue with no max capacity set and it will 
> grow to the max capacity of the cluster, but if I add resources to the node, 
> the job won't grow any more even though it has pending resources.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7575) When using absolute capacity configuration with no max capacity, scheduler UI NPEs and can't grow queue

2017-11-28 Thread Eric Payne (JIRA)
Eric Payne created YARN-7575:


 Summary: When using absolute capacity configuration with no max 
capacity, scheduler UI NPEs and can't grow queue
 Key: YARN-7575
 URL: https://issues.apache.org/jira/browse/YARN-7575
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler
Reporter: Eric Payne


I encountered the following while reviewing and testing branch YARN-5881.

The design document from YARN-5881 says that for max-capacity:
{quote}
3)  For each queue, we require:
a) if max-resource not set, it automatically set to parent.max-resource
{quote}

When I try leaving blank {{yarn.scheduler.capacity.< 
queue-path>.maximum-capacity}}, the RMUI scheduler page refuses to render. It 
looks like it's in {{CapacitySchedulerPage$ LeafQueueInfoBlock}}:
{noformat}
2017-11-28 11:29:16,974 [qtp43473566-220] ERROR webapp.Dispatcher: error 
handling URI: /cluster/scheduler
java.lang.reflect.InvocationTargetException
...
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderQueueCapacityInfo(CapacitySchedulerPage.java:164)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$LeafQueueInfoBlock.renderLeafQueueInfoWithoutParition(CapacitySchedulerPage.java:129)
{noformat}

Also... A job will run in the leaf queue with no max capacity set and it will 
grow to the max capacity of the cluster, but if I add resources to the node, 
the job won't grow any more even though it has pending resources.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7496) CS Intra-queue preemption user-limit calculations are not in line with LeafQueue user-limit calculations

2017-11-25 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265837#comment-16265837
 ] 

Eric Payne commented on YARN-7496:
--

Thank you very much, [~leftnoteasy]

> CS Intra-queue preemption user-limit calculations are not in line with 
> LeafQueue user-limit calculations
> 
>
> Key: YARN-7496
> URL: https://issues.apache.org/jira/browse/YARN-7496
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.2
>Reporter: Eric Payne
>Assignee: Eric Payne
> Fix For: 2.8.3
>
> Attachments: YARN-7496.001.branch-2.8.patch
>
>
> Only a problem in 2.8.
> Preemption could oscillate due to the difference in how user limit is 
> calculated between 2.8 and later releases.
> Basically (ignoring ULF, MULP, and maybe others), the calculation for user 
> limit on the Capacity Scheduler side in 2.8 is {{total used resources / 
> number of active users}} while the calculation in later releases is {{total 
> active resources / number of active users}}. When intra-queue preemption was 
> backported to 2.8, it's calculations for user limit were more aligned with 
> the latter algorithm, which is in 2.9 and later releases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7496) CS Intra-queue preemption user-limit calculations are not in line with LeafQueue user-limit calculations

2017-11-21 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261617#comment-16261617
 ] 

Eric Payne commented on YARN-7496:
--

{code}
 hadoop.yarn.server.resourcemanager.TestClientRMTokens
 hadoop.yarn.server.resourcemanager.TestAMAuthorization
 hadoop.yarn.server.resourcemanager.TestLeaderElectorService 
{code}
These tests are passing for me in my local environment.

> CS Intra-queue preemption user-limit calculations are not in line with 
> LeafQueue user-limit calculations
> 
>
> Key: YARN-7496
> URL: https://issues.apache.org/jira/browse/YARN-7496
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.2
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-7496.001.branch-2.8.patch
>
>
> Only a problem in 2.8.
> Preemption could oscillate due to the difference in how user limit is 
> calculated between 2.8 and later releases.
> Basically (ignoring ULF, MULP, and maybe others), the calculation for user 
> limit on the Capacity Scheduler side in 2.8 is {{total used resources / 
> number of active users}} while the calculation in later releases is {{total 
> active resources / number of active users}}. When intra-queue preemption was 
> backported to 2.8, it's calculations for user limit were more aligned with 
> the latter algorithm, which is in 2.9 and later releases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7533) Documentation for absolute resource support in CS

2017-11-21 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261565#comment-16261565
 ] 

Eric Payne commented on YARN-7533:
--

[~sunilg] this looks good to me.

> Documentation for absolute resource support in CS
> -
>
> Key: YARN-7533
> URL: https://issues.apache.org/jira/browse/YARN-7533
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: YARN-7533-YARN-5881.002.patch, YARN-7533.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7496) CS Intra-queue preemption user-limit calculations are not in line with LeafQueue user-limit calculations

2017-11-21 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261242#comment-16261242
 ] 

Eric Payne commented on YARN-7496:
--

Thanks for your review, [~leftnoteasy]. What's the next step?

> CS Intra-queue preemption user-limit calculations are not in line with 
> LeafQueue user-limit calculations
> 
>
> Key: YARN-7496
> URL: https://issues.apache.org/jira/browse/YARN-7496
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.2
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-7496.001.branch-2.8.patch
>
>
> Only a problem in 2.8.
> Preemption could oscillate due to the difference in how user limit is 
> calculated between 2.8 and later releases.
> Basically (ignoring ULF, MULP, and maybe others), the calculation for user 
> limit on the Capacity Scheduler side in 2.8 is {{total used resources / 
> number of active users}} while the calculation in later releases is {{total 
> active resources / number of active users}}. When intra-queue preemption was 
> backported to 2.8, it's calculations for user limit were more aligned with 
> the latter algorithm, which is in 2.9 and later releases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7533) Documentation for absolute resource support in CS

2017-11-20 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260064#comment-16260064
 ] 

Eric Payne commented on YARN-7533:
--

Hi [~sunilg]. Thanks for the patch. I have just a couple of grammatical 
suggestions:
 
- {{Resource Allocation}}, I would suggest to change it to the following:
{code}
+  * Resource Allocation using Absolute Resources configuration
+ `CapacityScheduler` supports configuraiton of absolute resources instead of 
providing Queue *capacity* in percentage. The following configurations could be 
used to configure absolute resources.
{code}

- {{yarn.scheduler.capacity..min-resource}}, something like:
{code}
+ | `yarn.scheduler.capacity..min-resource` | Absolute resource 
queue capacity minimum configuration. Default value is empty. 
[memory=10240,vcores=12] is a valid configuration which indicates 10GB Memory 
and 12 VCores.|
+ | `yarn.scheduler.capacity..max-resource` | Absolute resource 
queue capacity maximum configuration. Default value is empty. 
[memory=10240,vcores=12] is a valid configuration which indicates 10GB Memory 
and 12 VCores.|
{code}


> Documentation for absolute resource support in CS
> -
>
> Key: YARN-7533
> URL: https://issues.apache.org/jira/browse/YARN-7533
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Sunil G
>Assignee: Sunil G
> Attachments: YARN-7533.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6124) Make SchedulingEditPolicy can be enabled / disabled / updated with RMAdmin -refreshQueues

2017-11-20 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259865#comment-16259865
 ] 

Eric Payne commented on YARN-6124:
--

bq. AdminService#refreshQueues, conf.size(): I'm not sure why this is needed
I see that if this is not there, it gets the following exception.
{noformat}
refreshQueues: com.ctc.wstx.exc.WstxIOException: Stream closed
{noformat}
Still, calling {{conf.size()}} seems awkward. It seems like there should be a 
better way to do this.

> Make SchedulingEditPolicy can be enabled / disabled / updated with RMAdmin 
> -refreshQueues
> -
>
> Key: YARN-6124
> URL: https://issues.apache.org/jira/browse/YARN-6124
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Assignee: Zian Chen
> Attachments: YARN-6124.wip.1.patch, YARN-6124.wip.2.patch, 
> YARN-6124.wip.3.patch
>
>
> Now enabled / disable / update SchedulingEditPolicy config requires restart 
> RM. This is inconvenient when admin wants to make changes to 
> SchedulingEditPolicies.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6124) Make SchedulingEditPolicy can be enabled / disabled / updated with RMAdmin -refreshQueues

2017-11-20 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259540#comment-16259540
 ] 

Eric Payne commented on YARN-6124:
--

Thanks [~Zian Chen]. I appreciate the good work here. sorry for the late reply. 
I have a couple of comments.

- {{AdminService#refreshQueues}}, first comment:
{code}
// We use getConfig() before which gets a capacity-scheduler.xml reference
// when parsing it into CapacityScheduler#reinitialize, but we need to get
// properties from yarn-site.xml when we want to enable/disable preemption
{code}
-- I wouldn't say anything about what it did before or about the capacity 
scheduler, since this calls into all of the schedulers. Also, I wouldn't 
specify preemption properties since the scheduling monitor can be pluggable and 
doesen't have to be for preemption. I would just say something like this: {{// 
Retrieve yarn-site.xml in order to refresh scheduling monitor properties.}}
- {{AdminService#refreshQueues}}, {{conf.size()}}:
-- Comment says {{force the Configuration#getProps been called to reload all 
the properties.}}. I'm not sure why this is needed. I'm pretty sure that when 
{{SchedulingMonitorManager#updateSchedulingMonitors}} calls the following code, 
it will also call {{Configuration#getProps}} at that point:
{code}
boolean monitorsEnabled = conf.getBoolean(
YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS,
YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS);
{code}

> Make SchedulingEditPolicy can be enabled / disabled / updated with RMAdmin 
> -refreshQueues
> -
>
> Key: YARN-6124
> URL: https://issues.apache.org/jira/browse/YARN-6124
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Assignee: Zian Chen
> Attachments: YARN-6124.wip.1.patch, YARN-6124.wip.2.patch, 
> YARN-6124.wip.3.patch
>
>
> Now enabled / disable / update SchedulingEditPolicy config requires restart 
> RM. This is inconvenient when admin wants to make changes to 
> SchedulingEditPolicies.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7496) CS Intra-queue preemption user-limit calculations are not in line with LeafQueue user-limit calculations

2017-11-20 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259412#comment-16259412
 ] 

Eric Payne commented on YARN-7496:
--

Thanks for looking at this [~leftnoteasy]

> CS Intra-queue preemption user-limit calculations are not in line with 
> LeafQueue user-limit calculations
> 
>
> Key: YARN-7496
> URL: https://issues.apache.org/jira/browse/YARN-7496
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.2
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-7496.001.branch-2.8.patch
>
>
> Only a problem in 2.8.
> Preemption could oscillate due to the difference in how user limit is 
> calculated between 2.8 and later releases.
> Basically (ignoring ULF, MULP, and maybe others), the calculation for user 
> limit on the Capacity Scheduler side in 2.8 is {{total used resources / 
> number of active users}} while the calculation in later releases is {{total 
> active resources / number of active users}}. When intra-queue preemption was 
> backported to 2.8, it's calculations for user limit were more aligned with 
> the latter algorithm, which is in 2.9 and later releases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7496) CS Intra-queue preemption user-limit calculations are not in line with LeafQueue user-limit calculations

2017-11-16 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-7496:
-
Attachment: YARN-7496.001.branch-2.8.patch

Attaching a fix for branch-2.8. This change is in 
{{LeafQueue#computeUserLimit}}. It should only affect preemption user limit 
calculations and should not affect assignment user limit calculations.

Since it does touch the computations for user limit, I would really appreciate 
it if [~sunilg], [~leftnoteasy], or [~jlowe] could take a look at it.

> CS Intra-queue preemption user-limit calculations are not in line with 
> LeafQueue user-limit calculations
> 
>
> Key: YARN-7496
> URL: https://issues.apache.org/jira/browse/YARN-7496
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.2
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-7496.001.branch-2.8.patch
>
>
> Only a problem in 2.8.
> Preemption could oscillate due to the difference in how user limit is 
> calculated between 2.8 and later releases.
> Basically (ignoring ULF, MULP, and maybe others), the calculation for user 
> limit on the Capacity Scheduler side in 2.8 is {{total used resources / 
> number of active users}} while the calculation in later releases is {{total 
> active resources / number of active users}}. When intra-queue preemption was 
> backported to 2.8, it's calculations for user limit were more aligned with 
> the latter algorithm, which is in 2.9 and later releases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7469) Capacity Scheduler Intra-queue preemption: User can starve if newest app is exactly at user limit

2017-11-16 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16255891#comment-16255891
 ] 

Eric Payne commented on YARN-7469:
--

Thank you very much [~sunilg]. Just one concern is that this fix should also go 
into branch-2.9 since it is also in branch-2.8.

> Capacity Scheduler Intra-queue preemption: User can starve if newest app is 
> exactly at user limit
> -
>
> Key: YARN-7469
> URL: https://issues.apache.org/jira/browse/YARN-7469
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.9.0, 3.0.0-beta1, 2.8.2
>Reporter: Eric Payne
>Assignee: Eric Payne
> Fix For: 2.8.3, 3.0.0, 3.1.0, 2.10.0
>
> Attachments: UnitTestToShowStarvedUser.patch, YARN-7469.001.patch
>
>
> Queue Configuration:
> - Total Memory: 20GB
> - 2 Queues
> -- Queue1
> --- Memory: 10GB
> --- MULP: 10%
> --- ULF: 2.0
> - Minimum Container Size: 0.5GB
> Use Case:
> - User1 submits app1 to Queue1 and consumes 20GB
> - User2 submits app2 to Queue1 and requests 7.5GB
> - Preemption monitor preempts 7.5GB from app1. Capacity Scheduler gives those 
> resources to User2
> - User 3 submits app3 to Queue1. To begin with, app3 is requesting 1 
> container for the AM.
> - Preemption monitor never preempts a container.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7469) Capacity Scheduler Intra-queue preemption: User can starve if newest app is exactly at user limit

2017-11-15 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16253828#comment-16253828
 ] 

Eric Payne commented on YARN-7469:
--

Thanks [~sunilg]! That would be great!

> Capacity Scheduler Intra-queue preemption: User can starve if newest app is 
> exactly at user limit
> -
>
> Key: YARN-7469
> URL: https://issues.apache.org/jira/browse/YARN-7469
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.9.0, 3.0.0-beta1, 2.8.2
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: UnitTestToShowStarvedUser.patch, YARN-7469.001.patch
>
>
> Queue Configuration:
> - Total Memory: 20GB
> - 2 Queues
> -- Queue1
> --- Memory: 10GB
> --- MULP: 10%
> --- ULF: 2.0
> - Minimum Container Size: 0.5GB
> Use Case:
> - User1 submits app1 to Queue1 and consumes 20GB
> - User2 submits app2 to Queue1 and requests 7.5GB
> - Preemption monitor preempts 7.5GB from app1. Capacity Scheduler gives those 
> resources to User2
> - User 3 submits app3 to Queue1. To begin with, app3 is requesting 1 
> container for the AM.
> - Preemption monitor never preempts a container.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7469) Capacity Scheduler Intra-queue preemption: User can starve if newest app is exactly at user limit

2017-11-15 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16253721#comment-16253721
 ] 

Eric Payne commented on YARN-7469:
--

bq.  now min container is the dead zone here
I filed YARN-7501 to include a "dead zone" around the user limit.

bq. in 2.8, this fix has a problem of oscillation due to the difference in how 
user limit is calculated between 2.8 and later releases. 
[~sunilg], I think this patch should be used to fix the user starvation problem 
and the 2.8-specific oscillation problem can be handled by YARN-7496. 
{{YARN-7469.001.patch}} will apply cleanly to all branches back to branch-2.8.

> Capacity Scheduler Intra-queue preemption: User can starve if newest app is 
> exactly at user limit
> -
>
> Key: YARN-7469
> URL: https://issues.apache.org/jira/browse/YARN-7469
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.9.0, 3.0.0-beta1, 2.8.2
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: UnitTestToShowStarvedUser.patch, YARN-7469.001.patch
>
>
> Queue Configuration:
> - Total Memory: 20GB
> - 2 Queues
> -- Queue1
> --- Memory: 10GB
> --- MULP: 10%
> --- ULF: 2.0
> - Minimum Container Size: 0.5GB
> Use Case:
> - User1 submits app1 to Queue1 and consumes 20GB
> - User2 submits app2 to Queue1 and requests 7.5GB
> - Preemption monitor preempts 7.5GB from app1. Capacity Scheduler gives those 
> resources to User2
> - User 3 submits app3 to Queue1. To begin with, app3 is requesting 1 
> container for the AM.
> - Preemption monitor never preempts a container.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7501) Capacity Scheduler Intra-queue preemption should have a "dead zone" around user limit

2017-11-15 Thread Eric Payne (JIRA)
Eric Payne created YARN-7501:


 Summary: Capacity Scheduler Intra-queue preemption should have a 
"dead zone" around user limit
 Key: YARN-7501
 URL: https://issues.apache.org/jira/browse/YARN-7501
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler, scheduler preemption
Affects Versions: 3.0.0-beta1, 2.9.0, 2.8.2, 3.1.0
Reporter: Eric Payne






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7496) CS Intra-queue preemption user-limit calculations are not in line with LeafQueue user-limit calculations

2017-11-14 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16252643#comment-16252643
 ] 

Eric Payne commented on YARN-7496:
--

Cluster Configuration:
- Cluster Memory: 20GB
- Queue1 capacity and max capacity: 50% : 100%
- Queue2 capacity and max capacity: 50% : 100%
- Queue1: Intra-queue preemption: enabled
- Default container size: 0.5GB

Use Case:
- User1 submits App1 in Queue1 and consumes 12.5GB
- User2 submits App2 in Queue1 and consumes 7.5GB
- User3 submits App3 in Queue1
- Preemption monitor calculates user limit to be {{((total used resources in 
Queue1) / (number of all users)) + (1 container) = normalizeup((20GB/3),0.5GB) 
+ 0.5GB = 7GB + 0.5GB = 7.5GB}}
- Preemption monitor sees that App1 is the only one that has resources, so it 
tries to preempts containers down to 7.5GB from {{App1}}.
- The problem comes here: Capacity Scheduler calculates user limit to be 
{{((total used resources in Queue1) / (number of active users)) + (1 container) 
= normalizeup((20GB/2),0.5GB) + 0.5GB = 10GB + 0.5GB = 10.5GB}}
- Therefore, once {{App1}} gets to 10.5GB, the preemption monitor will try to 
preempt 2.5GB more resources from {{App1}}, but the Capacity Scheduler gives 
them back. This creates oscillation.

> CS Intra-queue preemption user-limit calculations are not in line with 
> LeafQueue user-limit calculations
> 
>
> Key: YARN-7496
> URL: https://issues.apache.org/jira/browse/YARN-7496
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.2
>Reporter: Eric Payne
>Assignee: Eric Payne
>
> Only a problem in 2.8.
> Preemption could oscillate due to the difference in how user limit is 
> calculated between 2.8 and later releases.
> Basically (ignoring ULF, MULP, and maybe others), the calculation for user 
> limit on the Capacity Scheduler side in 2.8 is {{total used resources / 
> number of active users}} while the calculation in later releases is {{total 
> active resources / number of active users}}. When intra-queue preemption was 
> backported to 2.8, it's calculations for user limit were more aligned with 
> the latter algorithm, which is in 2.9 and later releases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7496) CS Intra-queue preemption user-limit calculations are not in line with LeafQueue user-limit calculations

2017-11-14 Thread Eric Payne (JIRA)
Eric Payne created YARN-7496:


 Summary: CS Intra-queue preemption user-limit calculations are not 
in line with LeafQueue user-limit calculations
 Key: YARN-7496
 URL: https://issues.apache.org/jira/browse/YARN-7496
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.8.2
Reporter: Eric Payne
Assignee: Eric Payne


Only a problem in 2.8.

Preemption could oscillate due to the difference in how user limit is 
calculated between 2.8 and later releases.

Basically (ignoring ULF, MULP, and maybe others), the calculation for user 
limit on the Capacity Scheduler side in 2.8 is {{total used resources / number 
of active users}} while the calculation in later releases is {{total active 
resources / number of active users}}. When intra-queue preemption was 
backported to 2.8, it's calculations for user limit were more aligned with the 
latter algorithm, which is in 2.9 and later releases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7469) Capacity Scheduler Intra-queue preemption: User can starve if newest app is exactly at user limit

2017-11-13 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250237#comment-16250237
 ] 

Eric Payne commented on YARN-7469:
--

bq. In broader perspective, i think we are lacking dead zone here. In a way, 
now min container is the dead zone here. But if user gets more control on this, 
may be more oscillations could be avoided. May be we can take up that also in 
another ticket.
[~sunilg], Thanks for looking at the patch. Yes, I agree that a dead zone above 
the user limit would be a very helpful feature to add.

> Capacity Scheduler Intra-queue preemption: User can starve if newest app is 
> exactly at user limit
> -
>
> Key: YARN-7469
> URL: https://issues.apache.org/jira/browse/YARN-7469
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.9.0, 3.0.0-beta1, 2.8.2
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: UnitTestToShowStarvedUser.patch, YARN-7469.001.patch
>
>
> Queue Configuration:
> - Total Memory: 20GB
> - 2 Queues
> -- Queue1
> --- Memory: 10GB
> --- MULP: 10%
> --- ULF: 2.0
> - Minimum Container Size: 0.5GB
> Use Case:
> - User1 submits app1 to Queue1 and consumes 20GB
> - User2 submits app2 to Queue1 and requests 7.5GB
> - Preemption monitor preempts 7.5GB from app1. Capacity Scheduler gives those 
> resources to User2
> - User 3 submits app3 to Queue1. To begin with, app3 is requesting 1 
> container for the AM.
> - Preemption monitor never preempts a container.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7469) Capacity Scheduler Intra-queue preemption: User can starve if newest app is exactly at user limit

2017-11-13 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-7469:
-
Attachment: YARN-7469.001.patch

Attaching a proposal for a patch to fix this problem.

Proposed fix: In {{calculateToBePreemptedResourcePerApp}}, if 
{{USERLIMIT_FIRST}}  policy is set, subtract off minimum container size. 
Basically, the code in {{skipContainerBasedOnintraQueuePolicy}} skips the 
container if it will bring it down to the user limit because the capacity 
scheduler assigns one container more than the user limit.

Also, in 2.8, this fix has a problem of oscillation due to the difference in 
how user limit is calculated between 2.8 and later releases. Basically 
(ignoring ULF, MULP, and maybe others), the calculation in 2.8 is {{total used 
resources / number of active users}} while the calculation in later releases is 
{{total active resources / number of active users}}. With this fix in 2.8, it 
would cause the value of {{getResourceLimitForAllUsers}} (used by preemption 
monitor) to be greater than {{getHeadroom}} used by leafqueue, which would 
cause more preemption to occur than necessary.

Bottom line is that I'm still working on a 2.8 solution.

> Capacity Scheduler Intra-queue preemption: User can starve if newest app is 
> exactly at user limit
> -
>
> Key: YARN-7469
> URL: https://issues.apache.org/jira/browse/YARN-7469
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.9.0, 3.0.0-beta1, 2.8.2
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: UnitTestToShowStarvedUser.patch, YARN-7469.001.patch
>
>
> Queue Configuration:
> - Total Memory: 20GB
> - 2 Queues
> -- Queue1
> --- Memory: 10GB
> --- MULP: 10%
> --- ULF: 2.0
> - Minimum Container Size: 0.5GB
> Use Case:
> - User1 submits app1 to Queue1 and consumes 20GB
> - User2 submits app2 to Queue1 and requests 7.5GB
> - Preemption monitor preempts 7.5GB from app1. Capacity Scheduler gives those 
> resources to User2
> - User 3 submits app3 to Queue1. To begin with, app3 is requesting 1 
> container for the AM.
> - Preemption monitor never preempts a container.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7469) Capacity Scheduler Intra-queue preemption: User can starve if newest app is exactly at user limit

2017-11-10 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16247790#comment-16247790
 ] 

Eric Payne commented on YARN-7469:
--

When a queue is in the state as described above, 
{{FifoIntraQueuePreemptionPlugin#calculateToBePreemptedResourcePerApp}} decides 
(erroneously, I believe) that {{app2}} has preemptable resources. Since 
{{app2}} is the youngest with apparent resources, 
{{FifoIntraQueuePreemptionPlugin#preemptFromLeastStarvedApp}} selects a 
container to preempt from {{app2}}. However, when it calls 
{{FifoIntraQueuePreemptionPlugin#skipContainerBasedOnIntraQueuePolicy}}, it 
decides that preempting the selected container would bring the user limit down 
too far, so it skips the container. However, it doesn't go on to the next 
youngest app with resources.

The logic breaks down to basically this:
{code}
calculateToBePreemptedResourcePerApp {
  // preemtableFromApp will be used to select containers to preempt.
  preemtableFromApp = used - (userlimit - AmSize)
}

skipContainerBasedOnIntraQueuePolicy {
  if (used - selectedContainerSize) <= (userlimit + AmSize) {
Skip this container
  } 
}
{code}
We get into this starvation mode when {{selectedContainerSize}} ends up being 
the same size as {{preemtableFromApp}}

> Capacity Scheduler Intra-queue preemption: User can starve if newest app is 
> exactly at user limit
> -
>
> Key: YARN-7469
> URL: https://issues.apache.org/jira/browse/YARN-7469
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.9.0, 3.0.0-beta1, 2.8.2
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: UnitTestToShowStarvedUser.patch
>
>
> Queue Configuration:
> - Total Memory: 20GB
> - 2 Queues
> -- Queue1
> --- Memory: 10GB
> --- MULP: 10%
> --- ULF: 2.0
> - Minimum Container Size: 0.5GB
> Use Case:
> - User1 submits app1 to Queue1 and consumes 20GB
> - User2 submits app2 to Queue1 and requests 7.5GB
> - Preemption monitor preempts 7.5GB from app1. Capacity Scheduler gives those 
> resources to User2
> - User 3 submits app3 to Queue1. To begin with, app3 is requesting 1 
> container for the AM.
> - Preemption monitor never preempts a container.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7469) Capacity Scheduler Intra-queue preemption: User can starve if newest app is exactly at user limit

2017-11-09 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-7469:
-
Attachment: UnitTestToShowStarvedUser.patch

Uploading a unit test that demonstrates this.

> Capacity Scheduler Intra-queue preemption: User can starve if newest app is 
> exactly at user limit
> -
>
> Key: YARN-7469
> URL: https://issues.apache.org/jira/browse/YARN-7469
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.9.0, 3.0.0-beta1, 2.8.2
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: UnitTestToShowStarvedUser.patch
>
>
> Queue Configuration:
> - Total Memory: 20GB
> - 2 Queues
> -- Queue1
> --- Memory: 10GB
> --- MULP: 10%
> --- ULF: 2.0
> - Minimum Container Size: 0.5GB
> Use Case:
> - User1 submits app1 to Queue1 and consumes 20GB
> - User2 submits app2 to Queue1 and requests 7.5GB
> - Preemption monitor preempts 7.5GB from app1. Capacity Scheduler gives those 
> resources to User2
> - User 3 submits app3 to Queue1. To begin with, app3 is requesting 1 
> container for the AM.
> - Preemption monitor never preempts a container.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7469) Capacity Scheduler Intra-queue preemption: User can starve if newest app is exactly at user limit

2017-11-09 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-7469:
-
Description: 
Queue Configuration:
- Total Memory: 20GB
- 2 Queues
-- Queue1
--- Memory: 10GB
--- MULP: 10%
--- ULF: 2.0
- Minimum Container Size: 0.5GB

Use Case:
- User1 submits app1 to Queue1 and consumes 20GB
- User2 submits app2 to Queue1 and requests 7.5GB
- Preemption monitor preempts 7.5GB from app1. Capacity Scheduler gives those 
resources to User2
- User 3 submits app3 to Queue1. To begin with, app3 is requesting 1 container 
for the AM.
- Preemption monitor never preempts a container.

> Capacity Scheduler Intra-queue preemption: User can starve if newest app is 
> exactly at user limit
> -
>
> Key: YARN-7469
> URL: https://issues.apache.org/jira/browse/YARN-7469
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.9.0, 3.0.0-beta1, 2.8.2
>Reporter: Eric Payne
>Assignee: Eric Payne
>
> Queue Configuration:
> - Total Memory: 20GB
> - 2 Queues
> -- Queue1
> --- Memory: 10GB
> --- MULP: 10%
> --- ULF: 2.0
> - Minimum Container Size: 0.5GB
> Use Case:
> - User1 submits app1 to Queue1 and consumes 20GB
> - User2 submits app2 to Queue1 and requests 7.5GB
> - Preemption monitor preempts 7.5GB from app1. Capacity Scheduler gives those 
> resources to User2
> - User 3 submits app3 to Queue1. To begin with, app3 is requesting 1 
> container for the AM.
> - Preemption monitor never preempts a container.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7469) Capacity Scheduler Intra-queue preemption: User can starve if newest app is exactly at user limit

2017-11-09 Thread Eric Payne (JIRA)
Eric Payne created YARN-7469:


 Summary: Capacity Scheduler Intra-queue preemption: User can 
starve if newest app is exactly at user limit
 Key: YARN-7469
 URL: https://issues.apache.org/jira/browse/YARN-7469
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler, yarn
Affects Versions: 3.0.0-beta1, 2.9.0, 2.8.2
Reporter: Eric Payne
Assignee: Eric Payne






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7424) Capacity Scheduler Intra-queue preemption: add property to only preempt up to configured MULP

2017-11-03 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237768#comment-16237768
 ] 

Eric Payne commented on YARN-7424:
--

Thanks for the review, [~sunilg].
bq. 
"yarn.resourcemanager.monitor.capacity.preemption.intra-queue-preemption.minimum-threshold"
 could be configured to start intra queue preemption on a queue. But yes, this 
is generic in all queues.

IIUC {{minimum-threshold}} will prevent intra-queue preemption from acting 
within a queue until the queue's used resources are above {{minimum-thresshold 
* capacity}}, which is not really helpful here.

bq. max-allowable-limit helps only to control preemption in a given round of 
preemption calculation. This could be configured to a very low value so that 
only few resource will be preempted in such cases.
Yes, and we could try to reduce this value even more, which could potentially 
be helpful. I am just surprised that reducing this from 20% to 3% did not have 
nearly as much effect as I expected.

bq. Now the solution which you mentioned will help to control preemption.
Actually, after thinking about it more, the proposed solution is not very 
useful. Here's why:
- Queue1 is configured with 1% MULP
- User1 submits app1 to queue1 and consumes 100% of the resources
- User2 submits app2 to queue1 and requests resources
- The preemption monitor preempts resources from aap1 and the capacity 
scheduler gives them to app2 until app2 is at 1%
- User 3 submits app3 to queue1 and requests resources.
- The preemption monitor preempts resources from app1, but the capacity 
scheduler doesn't give them to app3. It gives them to app2 because the user 
limit resource value is 33%, and app2 came before app3, and user2 is below 33%.
- So, with the proposed solution, user3 keeps asking for resources and the 
preemption monitor keeps taking them from app1 and the capacity scheduler keeps 
giving them to app2 until user2 is above 33%.

If you multiply this out to 60 users all asking for resources in a queue with 
1% MULP, it is doing pretty much the exact same amount of preempting and 
balancing as before. In order to create the "desired" behavior, we would have 
to fundamentally change the way the capacity scheduler works, which we don't 
want to do.


> Capacity Scheduler Intra-queue preemption: add property to only preempt up to 
> configured MULP
> -
>
> Key: YARN-7424
> URL: https://issues.apache.org/jira/browse/YARN-7424
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.0.0-beta1, 2.8.2
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
>
> If the queue's configured minimum user limit percent (MULP) is something 
> small like 1%, all users will max out well over their MULP until 100 users 
> have apps in the queue. Since the intra-queue preemption monitor tries to 
> balance the resource among the users, most of the time in this use case it 
> will be preempting containers on behalf of users that are already over their 
> MULP guarantee.
> This JIRA proposes that a property should be provided so that a queue can be 
> configured to only preempt on behalf of a user until that user has reached 
> its MULP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7370) Preemption properties should be refreshable

2017-11-02 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16236151#comment-16236151
 ] 

Eric Payne commented on YARN-7370:
--

Findbugs warnings are for 
{{org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.DominantResourceFairnessPolicy}}
 and are not related to this patch.

{{CapacitySchedulerConfiguration.java}} has an unused import for 
{{ImmutableMap}}. I'll go ahead and remove it as part of the commit.

> Preemption properties should be refreshable
> ---
>
> Key: YARN-7370
> URL: https://issues.apache.org/jira/browse/YARN-7370
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.8.0, 3.0.0-alpha3
>Reporter: Eric Payne
>Assignee: Gergely Novák
>Priority: Major
> Attachments: YARN-7370.001.patch, YARN-7370.002.patch, 
> YARN-7370.003.patch, YARN-7370.004.patch, YARN-7370.005.patch
>
>
> At least the properties for {{max-allowable-limit}} and {{minimum-threshold}} 
> should be refreshable. It would also be nice to make 
> {{intra-queue-preemption.enabled}} and {{preemption-order-policy}} 
> refreshable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7370) Preemption properties should be refreshable

2017-11-02 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16236089#comment-16236089
 ] 

Eric Payne commented on YARN-7370:
--

Waiting for precommit to complete: 
https://builds.apache.org/job/PreCommit-YARN-Build/18307/

> Preemption properties should be refreshable
> ---
>
> Key: YARN-7370
> URL: https://issues.apache.org/jira/browse/YARN-7370
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.8.0, 3.0.0-alpha3
>Reporter: Eric Payne
>Assignee: Gergely Novák
>Priority: Major
> Attachments: YARN-7370.001.patch, YARN-7370.002.patch, 
> YARN-7370.003.patch, YARN-7370.004.patch
>
>
> At least the properties for {{max-allowable-limit}} and {{minimum-threshold}} 
> should be refreshable. It would also be nice to make 
> {{intra-queue-preemption.enabled}} and {{preemption-order-policy}} 
> refreshable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7370) Preemption properties should be refreshable

2017-11-02 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16235862#comment-16235862
 ] 

Eric Payne commented on YARN-7370:
--

[~sunilg], thanks for your comment. I see what you mean. If we take out the 
{{if (this.csConfig != null)}}, then the values will also be logged during 
initialization as well as during refresh. That way we can compare newly logged 
values with the initial ones.

[~GergelyNovak], Sorry for changing it at this point, but [~sunilg] has a good 
point. Would you mind making this change?

> Preemption properties should be refreshable
> ---
>
> Key: YARN-7370
> URL: https://issues.apache.org/jira/browse/YARN-7370
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.8.0, 3.0.0-alpha3
>Reporter: Eric Payne
>Assignee: Gergely Novák
>Priority: Major
> Attachments: YARN-7370.001.patch, YARN-7370.002.patch, 
> YARN-7370.003.patch
>
>
> At least the properties for {{max-allowable-limit}} and {{minimum-threshold}} 
> should be refreshable. It would also be nice to make 
> {{intra-queue-preemption.enabled}} and {{preemption-order-policy}} 
> refreshable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7370) Preemption properties should be refreshable

2017-11-02 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16235817#comment-16235817
 ] 

Eric Payne commented on YARN-7370:
--

[~GergelyNovak], Thanks for your effort on this feature. Patch LGTM
+1

> Preemption properties should be refreshable
> ---
>
> Key: YARN-7370
> URL: https://issues.apache.org/jira/browse/YARN-7370
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.8.0, 3.0.0-alpha3
>Reporter: Eric Payne
>Assignee: Gergely Novák
>Priority: Major
> Attachments: YARN-7370.001.patch, YARN-7370.002.patch, 
> YARN-7370.003.patch
>
>
> At least the properties for {{max-allowable-limit}} and {{minimum-threshold}} 
> should be refreshable. It would also be nice to make 
> {{intra-queue-preemption.enabled}} and {{preemption-order-policy}} 
> refreshable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7370) Preemption properties should be refreshable

2017-11-02 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16235747#comment-16235747
 ] 

Eric Payne commented on YARN-7370:
--

[~subru] and [~asuresh],

I am aware that you have requested that things should not be checked into 
branch-2.9. However, we need this feature to go into branch-2.8, and I think it 
will be awkward if this feature is in 2.8, branch-2, branch-3 and trunk, but 
not branch-2.9. Would it be appropriate to commit this to branch-2.9 as well?

> Preemption properties should be refreshable
> ---
>
> Key: YARN-7370
> URL: https://issues.apache.org/jira/browse/YARN-7370
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.8.0, 3.0.0-alpha3
>Reporter: Eric Payne
>Assignee: Gergely Novák
>Priority: Major
> Attachments: YARN-7370.001.patch, YARN-7370.002.patch, 
> YARN-7370.003.patch
>
>
> At least the properties for {{max-allowable-limit}} and {{minimum-threshold}} 
> should be refreshable. It would also be nice to make 
> {{intra-queue-preemption.enabled}} and {{preemption-order-policy}} 
> refreshable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7370) Preemption properties should be refreshable

2017-11-02 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16235740#comment-16235740
 ] 

Eric Payne commented on YARN-7370:
--

Thanks for the patch, [~GergelyNovak]. I will review today.
bq. The reason I changed the default constants is the precision problem of the 
automatic conversion: 0.1 becomes 0.1000149011612 and it looked funny in 
the newly introduced log messa
Fair enough.

> Preemption properties should be refreshable
> ---
>
> Key: YARN-7370
> URL: https://issues.apache.org/jira/browse/YARN-7370
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.8.0, 3.0.0-alpha3
>Reporter: Eric Payne
>Assignee: Gergely Novák
>Priority: Major
> Attachments: YARN-7370.001.patch, YARN-7370.002.patch, 
> YARN-7370.003.patch
>
>
> At least the properties for {{max-allowable-limit}} and {{minimum-threshold}} 
> should be refreshable. It would also be nice to make 
> {{intra-queue-preemption.enabled}} and {{preemption-order-policy}} 
> refreshable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-7424) Capacity Scheduler Intra-queue preemption: add property to only preempt up to configured MULP

2017-10-31 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne reassigned YARN-7424:


Assignee: Eric Payne

> Capacity Scheduler Intra-queue preemption: add property to only preempt up to 
> configured MULP
> -
>
> Key: YARN-7424
> URL: https://issues.apache.org/jira/browse/YARN-7424
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.0.0-beta1, 2.8.2
>Reporter: Eric Payne
>Assignee: Eric Payne
>
> If the queue's configured minimum user limit percent (MULP) is something 
> small like 1%, all users will max out well over their MULP until 100 users 
> have apps in the queue. Since the intra-queue preemption monitor tries to 
> balance the resource among the users, most of the time in this use case it 
> will be preempting containers on behalf of users that are already over their 
> MULP guarantee.
> This JIRA proposes that a property should be provided so that a queue can be 
> configured to only preempt on behalf of a user until that user has reached 
> its MULP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7424) Capacity Scheduler Intra-queue preemption: add property to only preempt up to configured MULP

2017-10-31 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16227528#comment-16227528
 ] 

Eric Payne commented on YARN-7424:
--

In a large, multi-tenant queue with MULP of 1%, after instrumenting intra-queue 
preemption, we have discovered that enabling both inter-queue and intra-queue 
preemption causes an order of magnitude more lost work than enabling only 
inter-queue preemption alone. Even after reducing 
{{intra-queue-preemption.max-allowable-limit}} from 20% (default) to 3%, the 
lost work is still several times more than with just inter-queue alone.

| | *MemSeconds Lost* |
| *Only inter-queue preemption enabled* | {{LostCrossQueueMemSec}} |
| *Both inter- and intra-queue preemption enabled with 20% max-allocaion-limit* 
| {{12.7824 * LostCrossQueueMemSec}} |
| *Both inter- and intra-queue preemption enabled with 3% max-allocaion-limit* 
| {{7.9893 * LostCrossQueueMemSec}} |

| | *Vcoreseconds Lost* |
| *Only inter-queue preemption enabled* | {{LostCrossQueueVSec}} |
| *Both inter- and intra-queue preemption enabled with 20% max-allocaion-limit* 
| {{26.1885 * LostCrossQueueVSec}} |
| *Both inter- and intra-queue preemption enabled with 3% max-allocaion-limit* 
| {{19.2676 * LostCrossQueueVSec}} |

It is expected that turning on intra-queue preemption would increase the number 
of preemptions. However, an order of magnituded more seems excessive. Also, 
reducing {{intra-queue-preemption.max-allowable-limit}} didn't have nearly the 
effect I thought it should.

I think there is an underlying design philosophy that should be addressed.

The current intra-queue preemption design balances the user limit among all of 
the users. This calculation is based on the total queue capacity and the number 
of users in the queue. In a very large queue with a large number of active 
users, the number of users in the queue is constantly changing. Also, if the 
node overcommit feature is enabled, the total size of the queue will change as 
well when the cluster becomes very busy. The result is that preemption must 
constantly happen in order to balance all of the users.

For this reason, we need a configuration property that stops preempting on 
behalf of a user once the user is above the MULP, which is a stable value. As a 
variation, we may want to have a "live zone" of MULP plus some configurable 
value.


> Capacity Scheduler Intra-queue preemption: add property to only preempt up to 
> configured MULP
> -
>
> Key: YARN-7424
> URL: https://issues.apache.org/jira/browse/YARN-7424
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.0.0-beta1, 2.8.2
>Reporter: Eric Payne
>
> If the queue's configured minimum user limit percent (MULP) is something 
> small like 1%, all users will max out well over their MULP until 100 users 
> have apps in the queue. Since the intra-queue preemption monitor tries to 
> balance the resource among the users, most of the time in this use case it 
> will be preempting containers on behalf of users that are already over their 
> MULP guarantee.
> This JIRA proposes that a property should be provided so that a queue can be 
> configured to only preempt on behalf of a user until that user has reached 
> its MULP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7424) Capacity Scheduler Intra-queue preemption: add property to only preempt up to configured MULP

2017-10-31 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-7424:
-
Description: 
If the queue's configured minimum user limit percent (MULP) is something small 
like 1%, all users will max out well over their MULP until 100 users have apps 
in the queue. Since the intra-queue preemption monitor tries to balance the 
resource among the users, most of the time in this use case it will be 
preempting containers on behalf of users that are already over their MULP 
guarantee.

This JIRA proposes that a property should be provided so that a queue can be 
configured to only preempt on behalf of a user until that user has reached its 
MULP.


  was:




> Capacity Scheduler Intra-queue preemption: add property to only preempt up to 
> configured MULP
> -
>
> Key: YARN-7424
> URL: https://issues.apache.org/jira/browse/YARN-7424
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 3.0.0-beta1, 2.8.2
>Reporter: Eric Payne
>
> If the queue's configured minimum user limit percent (MULP) is something 
> small like 1%, all users will max out well over their MULP until 100 users 
> have apps in the queue. Since the intra-queue preemption monitor tries to 
> balance the resource among the users, most of the time in this use case it 
> will be preempting containers on behalf of users that are already over their 
> MULP guarantee.
> This JIRA proposes that a property should be provided so that a queue can be 
> configured to only preempt on behalf of a user until that user has reached 
> its MULP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7424) Capacity Scheduler Intra-queue preemption: add property to only preempt up to configured MULP

2017-10-31 Thread Eric Payne (JIRA)
Eric Payne created YARN-7424:


 Summary: Capacity Scheduler Intra-queue preemption: add property 
to only preempt up to configured MULP
 Key: YARN-7424
 URL: https://issues.apache.org/jira/browse/YARN-7424
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler, scheduler preemption
Affects Versions: 3.0.0-beta1, 2.8.2
Reporter: Eric Payne







--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7370) Preemption properties should be refreshable

2017-10-31 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16227276#comment-16227276
 ] 

Eric Payne commented on YARN-7370:
--

[~GergelyNovak], Thanks for the updated patch. Just a couple of things:
- Why were {{DEFAULT_PREEMPTION_MAX_IGNORED_OVER_CAPACITY}} and 
{{DEFAULT_PREEMPTION_NATURAL_TERMINATION_FACTOR}} changed from float to double? 
The capacity scheduler configuration properties are not consistent about the 
usage of float and double, but it looks like the preemption properties are 
using float. If we want to make it consistent or change these to double, I 
would prefer to do it as a separate JIRA.
- Thanks for adding the log documenting the updated properties. Can you please 
add the following properties to the log statement?
-- isIntraQueuePreemptionEnabled
-- selectCandidatesForResevedContainers
-- isQueuePriorityPreemptionEnabled
-- additionalPreemptionBasedOnReservedResource



> Preemption properties should be refreshable
> ---
>
> Key: YARN-7370
> URL: https://issues.apache.org/jira/browse/YARN-7370
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.8.0, 3.0.0-alpha3
>Reporter: Eric Payne
>Assignee: Gergely Novák
> Attachments: YARN-7370.001.patch, YARN-7370.002.patch
>
>
> At least the properties for {{max-allowable-limit}} and {{minimum-threshold}} 
> should be refreshable. It would also be nice to make 
> {{intra-queue-preemption.enabled}} and {{preemption-order-policy}} 
> refreshable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7370) Preemption properties should be refreshable

2017-10-27 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16222830#comment-16222830
 ] 

Eric Payne commented on YARN-7370:
--

Thanks [~GergelyNovak] for the work on this patch. I just have a couple of 
small issues with the patch and one suggestion.

- {{ProportionalCapacityPreemptionPolicy}} has an unused import of 
{{YarnConfiguration}}
- In {{ProportionalCapacityPreemptionPolicy#updateConfigIfNeeded}}, can we 
switch the names of the local {{csConfig}} variable and the global class 
instance variable {{config}}? My opinion is that a class instance variable 
should have the more descriptive name.
- It would be nice if {{updateConfigIfNeeded}} would LOG the values of all of 
the properties so that we have a record in the RM syslog whenever the values 
are refreshed.

> Preemption properties should be refreshable
> ---
>
> Key: YARN-7370
> URL: https://issues.apache.org/jira/browse/YARN-7370
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.8.0, 3.0.0-alpha3
>Reporter: Eric Payne
>Assignee: Gergely Novák
> Attachments: YARN-7370.001.patch
>
>
> At least the properties for {{max-allowable-limit}} and {{minimum-threshold}} 
> should be refreshable. It would also be nice to make 
> {{intra-queue-preemption.enabled}} and {{preemption-order-policy}} 
> refreshable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6124) Make SchedulingEditPolicy can be enabled / disabled / updated with RMAdmin -refreshQueues

2017-10-26 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16221064#comment-16221064
 ] 

Eric Payne commented on YARN-6124:
--

Thanks [~leftnoteasy]. I will document my findings and you can work on it when 
you get to it. YARN-7370 doesn't depend on this JIRA, does it?

I got it to move past the NPE, but the changes I made may not be the best (it 
may have other side effects):
{code}
   public void serviceInit(Configuration conf) throws Exception {
 Configuration configuration = new Configuration(conf);
-super.serviceInit(conf);
 initScheduler(configuration);
+super.serviceInit(conf);
   }
{code}
Also, a quick test didn't seem to work. I started the RM with 
{{yarn.resourcemanager.scheduler.monitor.enable}} set to {{true}}, changed it 
to false, and then did {{-refreshQueues}}. It's going through the 
{{updateSchedulingMonitors}} code but it doesn't change the value.

> Make SchedulingEditPolicy can be enabled / disabled / updated with RMAdmin 
> -refreshQueues
> -
>
> Key: YARN-6124
> URL: https://issues.apache.org/jira/browse/YARN-6124
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-6124.wip.1.patch, YARN-6124.wip.2.patch
>
>
> Now enabled / disable / update SchedulingEditPolicy config requires restart 
> RM. This is inconvenient when admin wants to make changes to 
> SchedulingEditPolicies.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6124) Make SchedulingEditPolicy can be enabled / disabled / updated with RMAdmin -refreshQueues

2017-10-26 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220844#comment-16220844
 ] 

Eric Payne commented on YARN-6124:
--

Thanks [~leftnoteasy]. The proof of concept looks good, but in this version the 
{{ProportionalCapacityPreemptionPolicy}} is NPE-ing during {{init}} because 
{{scheduler.getConfiguration()}} is returning null.

> Make SchedulingEditPolicy can be enabled / disabled / updated with RMAdmin 
> -refreshQueues
> -
>
> Key: YARN-6124
> URL: https://issues.apache.org/jira/browse/YARN-6124
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-6124.wip.1.patch, YARN-6124.wip.2.patch
>
>
> Now enabled / disable / update SchedulingEditPolicy config requires restart 
> RM. This is inconvenient when admin wants to make changes to 
> SchedulingEditPolicies.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7370) Preemption properties should be refreshable

2017-10-25 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16218617#comment-16218617
 ] 

Eric Payne commented on YARN-7370:
--

Thanks [~leftnoteasy] for the further design specifications.

bq. YARN-6142, we will take care of all scheduling edit policy refresh.
YARN-6142 is closed, so I'm not sure where the actual work will take place.

As for the rest, it sounds like a good plan.

> Preemption properties should be refreshable
> ---
>
> Key: YARN-7370
> URL: https://issues.apache.org/jira/browse/YARN-7370
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.8.0, 3.0.0-alpha3
>Reporter: Eric Payne
>Assignee: Gergely Novák
>
> At least the properties for {{max-allowable-limit}} and {{minimum-threshold}} 
> should be refreshable. It would also be nice to make 
> {{intra-queue-preemption.enabled}} and {{preemption-order-policy}} 
> refreshable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6124) Make SchedulingEditPolicy can be enabled / disabled / updated with RMAdmin -refreshQueues

2017-10-24 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16216867#comment-16216867
 ] 

Eric Payne commented on YARN-6124:
--

Yes, I agree that these should be part of the scheduler. That makes a lot of 
sense.

> Make SchedulingEditPolicy can be enabled / disabled / updated with RMAdmin 
> -refreshQueues
> -
>
> Key: YARN-6124
> URL: https://issues.apache.org/jira/browse/YARN-6124
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-6124.wip.1.patch
>
>
> Now enabled / disable / update SchedulingEditPolicy config requires restart 
> RM. This is inconvenient when admin wants to make changes to 
> SchedulingEditPolicies.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7370) Intra-queue preemption properties should be refreshable

2017-10-23 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215525#comment-16215525
 ] 

Eric Payne commented on YARN-7370:
--

Thanks [~leftnoteasy], [~sunilg], and [~GergelyNovak].

So, just to be clear, I think we would all like the following preemption 
properties to be refreshable with {{yarn rmadmin -refreshQueues}}:
{noformat}

yarn.resourcemanager.scheduler.monitor.enable
yarn.resourcemanager.monitor.capacity.preemption.max_ignored_over_capacity
yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill
yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval
yarn.resourcemanager.monitor.capacity.preemption.natural_termination_factor
yarn.resourcemanager.monitor.capacity.preemption.observe_only
yarn.resourcemanager.monitor.capacity.preemption.select_based_on_reserved_containers
yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round
yarn.scheduler.capacity.lazy-preemption-enabled


# Intra-queue-specific properties:
yarn.resourcemanager.monitor.capacity.preemption.intra-queue-preemption.enabled
yarn.resourcemanager.monitor.capacity.preemption.intra-queue-preemption.minimum-threshold
yarn.resourcemanager.monitor.capacity.preemption.intra-queue-preemption.max-allowable-limit
yarn.resourcemanager.monitor.capacity.preemption.intra-queue-preemption.preemption-order-policy
{noformat}

I do NOT think we want to refresh 
{{yarn.resourcemanager.scheduler.monitor.policies}} since that would require 
stopping and restarting the monitor thread. At least, if we want to make this 
refreshable, I suggest that we do it as part of a separate JIRA.

Also, just FYI, the 
{{yarn.scheduler.capacity.root.\[QUEUEPATH\].disable_preemption}} property is 
already refreshable.


> Intra-queue preemption properties should be refreshable
> ---
>
> Key: YARN-7370
> URL: https://issues.apache.org/jira/browse/YARN-7370
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.8.0, 3.0.0-alpha3
>Reporter: Eric Payne
>
> At least the properties for {{max-allowable-limit}} and {{minimum-threshold}} 
> should be refreshable. It would also be nice to make 
> {{intra-queue-preemption.enabled}} and {{preemption-order-policy}} 
> refreshable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4163) Audit getQueueInfo and getApplications calls

2017-10-23 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215426#comment-16215426
 ] 

Eric Payne commented on YARN-4163:
--

Thanks [~jlowe] and [~lichangleo]. I will commit this to trunk, branch-3.0, 
branch-2, and branch-2.8.

> Audit getQueueInfo and getApplications calls
> 
>
> Key: YARN-4163
> URL: https://issues.apache.org/jira/browse/YARN-4163
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4163.004.patch, YARN-4163.005.patch, 
> YARN-4163.006.branch-2.8.patch, YARN-4163.006.patch, 
> YARN-4163.007.branch-2.8.patch, YARN-4163.007.patch, YARN-4163.2.patch, 
> YARN-4163.2.patch, YARN-4163.3.patch, YARN-4163.patch
>
>
> getQueueInfo and getApplications seem to sometimes cause spike of load but 
> not able to confirm due to they are not audit logged. This patch propose to 
> add them to audit log



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4163) Audit getQueueInfo and getApplications calls

2017-10-23 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-4163:
-
Attachment: YARN-4163.007.branch-2.8.patch

Attach branch-2.8 specific patch

> Audit getQueueInfo and getApplications calls
> 
>
> Key: YARN-4163
> URL: https://issues.apache.org/jira/browse/YARN-4163
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4163.004.patch, YARN-4163.005.patch, 
> YARN-4163.006.branch-2.8.patch, YARN-4163.006.patch, 
> YARN-4163.007.branch-2.8.patch, YARN-4163.007.patch, YARN-4163.2.patch, 
> YARN-4163.2.patch, YARN-4163.3.patch, YARN-4163.patch
>
>
> getQueueInfo and getApplications seem to sometimes cause spike of load but 
> not able to confirm due to they are not audit logged. This patch propose to 
> add them to audit log



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7370) Intra-queue preemption properties should be refreshable

2017-10-20 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212646#comment-16212646
 ] 

Eric Payne commented on YARN-7370:
--

[~GergelyNovak], thank you for your interest. Please go ahead and take this 
JIRA.
bq. 2) Do you mean to add a new rmadmin command like -refreshSchedulingMonitors 
or make this part of -refreshQueues?
My opinion is to include these as part of the {{-refreshQueues}} option. The 
queue-specific disable preemption option is refreshable under 
{{-refreshQueues}}, so I think it makes sense to refresh the others in the same 
way.

> Intra-queue preemption properties should be refreshable
> ---
>
> Key: YARN-7370
> URL: https://issues.apache.org/jira/browse/YARN-7370
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.8.0, 3.0.0-alpha3
>Reporter: Eric Payne
>
> At least the properties for {{max-allowable-limit}} and {{minimum-threshold}} 
> should be refreshable. It would also be nice to make 
> {{intra-queue-preemption.enabled}} and {{preemption-order-policy}} 
> refreshable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7370) Intra-queue preemption properties should be refreshable

2017-10-19 Thread Eric Payne (JIRA)
Eric Payne created YARN-7370:


 Summary: Intra-queue preemption properties should be refreshable
 Key: YARN-7370
 URL: https://issues.apache.org/jira/browse/YARN-7370
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler, scheduler preemption
Affects Versions: 3.0.0-alpha3, 2.8.0
Reporter: Eric Payne


At least the properties for {{max-allowable-limit}} and {{minimum-threshold}} 
should be refreshable. It would also be nice to make 
{{intra-queue-preemption.enabled}} and {{preemption-order-policy}} refreshable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4163) Audit getQueueInfo and getApplications calls

2017-10-18 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209494#comment-16209494
 ] 

Eric Payne commented on YARN-4163:
--

The unit tests are passing for me in my local environment.

> Audit getQueueInfo and getApplications calls
> 
>
> Key: YARN-4163
> URL: https://issues.apache.org/jira/browse/YARN-4163
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4163.004.patch, YARN-4163.005.patch, 
> YARN-4163.006.branch-2.8.patch, YARN-4163.006.patch, YARN-4163.007.patch, 
> YARN-4163.2.patch, YARN-4163.2.patch, YARN-4163.3.patch, YARN-4163.patch
>
>
> getQueueInfo and getApplications seem to sometimes cause spike of load but 
> not able to confirm due to they are not audit logged. This patch propose to 
> add them to audit log



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4163) Audit getQueueInfo and getApplications calls

2017-10-17 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-4163:
-
Attachment: YARN-4163.007.patch

Thanks [~jlowe] for the review.

I have made some "generic" APIs for successful and failure logs that take the 
common set of arguments plus an {{ArgsBuilder}} that contains the 
operation-specific arguments.

These generic APIs could be used to replace the existing success and failure 
log methods. I suggest that a separate JIRA be created for that.

I have a separate branch-2.8 patch that I will upload once the pre-commit build 
completes for this patch.

> Audit getQueueInfo and getApplications calls
> 
>
> Key: YARN-4163
> URL: https://issues.apache.org/jira/browse/YARN-4163
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4163.004.patch, YARN-4163.005.patch, 
> YARN-4163.006.branch-2.8.patch, YARN-4163.006.patch, YARN-4163.007.patch, 
> YARN-4163.2.patch, YARN-4163.2.patch, YARN-4163.3.patch, YARN-4163.patch
>
>
> getQueueInfo and getApplications seem to sometimes cause spike of load but 
> not able to confirm due to they are not audit logged. This patch propose to 
> add them to audit log



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7245) In Cap Sched UI, Max AM Resource column in Active Users Info section should be per-user

2017-10-05 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193591#comment-16193591
 ] 

Eric Payne commented on YARN-7245:
--

Thanks [~sunilg]. Looking forward to your response.

> In Cap Sched UI, Max AM Resource column in Active Users Info section should 
> be per-user
> ---
>
> Key: YARN-7245
> URL: https://issues.apache.org/jira/browse/YARN-7245
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.9.0, 2.8.1, 3.0.0-alpha4
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: CapSched UI Showing Inaccurate Per User Max AM 
> Resource.png, Max AM Resource Per User -- Fixed.png, YARN-7245.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4163) Audit getQueueInfo and getApplications calls

2017-10-05 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-4163:
-
Attachment: YARN-4163.006.branch-2.8.patch

Attaching branch-2.8 patch

> Audit getQueueInfo and getApplications calls
> 
>
> Key: YARN-4163
> URL: https://issues.apache.org/jira/browse/YARN-4163
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4163.004.patch, YARN-4163.005.patch, 
> YARN-4163.006.branch-2.8.patch, YARN-4163.006.patch, YARN-4163.2.patch, 
> YARN-4163.2.patch, YARN-4163.3.patch, YARN-4163.patch
>
>
> getQueueInfo and getApplications seem to sometimes cause spike of load but 
> not able to confirm due to they are not audit logged. This patch propose to 
> add them to audit log



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4163) Audit getQueueInfo and getApplications calls

2017-10-05 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-4163:
-
Attachment: YARN-4163.006.patch

Attaching {{YARN-4163.006.patch}} to address checkstyle issues.

> Audit getQueueInfo and getApplications calls
> 
>
> Key: YARN-4163
> URL: https://issues.apache.org/jira/browse/YARN-4163
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4163.004.patch, YARN-4163.005.patch, 
> YARN-4163.006.patch, YARN-4163.2.patch, YARN-4163.2.patch, YARN-4163.3.patch, 
> YARN-4163.patch
>
>
> getQueueInfo and getApplications seem to sometimes cause spike of load but 
> not able to confirm due to they are not audit logged. This patch propose to 
> add them to audit log



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4163) Audit getQueueInfo and getApplications calls

2017-10-05 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193102#comment-16193102
 ] 

Eric Payne commented on YARN-4163:
--

OK, so there are still some valid checkstyle warnings. Interestingly, when I 
ran testpatch locally, none of these showed up.

The only ones I won't be fixing are those complaining about too many args in 
the method signature. In order to fix this, I would have to refactor the 
methods.

Also, this patch applies cleanly to trunk, branch-3, and branch-2, but not 
branch-2.8. I will upload a separate branch-2.8 patch.

> Audit getQueueInfo and getApplications calls
> 
>
> Key: YARN-4163
> URL: https://issues.apache.org/jira/browse/YARN-4163
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4163.004.patch, YARN-4163.005.patch, 
> YARN-4163.2.patch, YARN-4163.2.patch, YARN-4163.3.patch, YARN-4163.patch
>
>
> getQueueInfo and getApplications seem to sometimes cause spike of load but 
> not able to confirm due to they are not audit logged. This patch propose to 
> add them to audit log



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4163) Audit getQueueInfo and getApplications calls

2017-10-05 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-4163:
-
Attachment: YARN-4163.005.patch

[~jlowe], attaching patch {{YARN-4163.005.patch}}. This contains an ArgsBuilder 
class, as suggested, and fixes for javadocs warnings. I fixed some of the 
checkstyle warnings, but others I did not fix due to other considerations. I 
will comment further once the pre-commit build comes back with the current 
warnings.

> Audit getQueueInfo and getApplications calls
> 
>
> Key: YARN-4163
> URL: https://issues.apache.org/jira/browse/YARN-4163
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4163.004.patch, YARN-4163.005.patch, 
> YARN-4163.2.patch, YARN-4163.2.patch, YARN-4163.3.patch, YARN-4163.patch
>
>
> getQueueInfo and getApplications seem to sometimes cause spike of load but 
> not able to confirm due to they are not audit logged. This patch propose to 
> add them to audit log



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7245) In Cap Sched UI, Max AM Resource column in Active Users Info section should be per-user

2017-10-04 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-7245:
-
Attachment: Max AM Resource Per User -- Fixed.png

I attached {{YARN-7245.001.patch}} to address this.

I also attached a screenshot to show that the value {{Max AM Resource}} column 
matches the value in the {{Max Application Master Resources Per User}} field.

> In Cap Sched UI, Max AM Resource column in Active Users Info section should 
> be per-user
> ---
>
> Key: YARN-7245
> URL: https://issues.apache.org/jira/browse/YARN-7245
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.9.0, 2.8.1, 3.0.0-alpha4
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: CapSched UI Showing Inaccurate Per User Max AM 
> Resource.png, Max AM Resource Per User -- Fixed.png, YARN-7245.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7245) In Cap Sched UI, Max AM Resource column in Active Users Info section should be per-user

2017-10-04 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-7245:
-
Attachment: YARN-7245.001.patch

> In Cap Sched UI, Max AM Resource column in Active Users Info section should 
> be per-user
> ---
>
> Key: YARN-7245
> URL: https://issues.apache.org/jira/browse/YARN-7245
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.9.0, 2.8.1, 3.0.0-alpha4
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: CapSched UI Showing Inaccurate Per User Max AM 
> Resource.png, YARN-7245.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7271) Add a yarn application cost calculation framework in TimelineService v2

2017-10-02 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188450#comment-16188450
 ] 

Eric Payne commented on YARN-7271:
--

[~vrushalic], The RM has a built-in calculation that keeps track of memory and 
vcore usage. I'm linking YARN-415 to see if it meets your needs.

> Add a yarn application cost calculation framework in TimelineService v2
> ---
>
> Key: YARN-7271
> URL: https://issues.apache.org/jira/browse/YARN-7271
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineclient, timelinereader, timelineserver
>Reporter: Vrushali C
>
> Timeline Service v2 captures information about a yarn application. From this 
> info, we would like to calculate the "cost" of an yarn application. This 
> would be rolled up to the flow level  as well (and user and queue level 
> eventually).
> We need a way to accept machine cost (TCO per day) and enable this 
> calculation. This will help in chargeback for yarn apps. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7245) In Cap Sched UI, Max AM Resource column in Active Users Info section should be per-user

2017-10-02 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188018#comment-16188018
 ] 

Eric Payne commented on YARN-7245:
--

bq. This is bad. We ideally need user based max-am-limit.
[~sunilg], the value for {{Max Application Master Resources Per User}} exists 
and is used by the scheduler. However, the per-user section under {{Active 
Users Info}} displays the value for the whole queue instead of per user. This 
is a problem in the GUI only.

> In Cap Sched UI, Max AM Resource column in Active Users Info section should 
> be per-user
> ---
>
> Key: YARN-7245
> URL: https://issues.apache.org/jira/browse/YARN-7245
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.9.0, 2.8.1, 3.0.0-alpha4
>Reporter: Eric Payne
> Attachments: CapSched UI Showing Inaccurate Per User Max AM 
> Resource.png
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-7245) In Cap Sched UI, Max AM Resource column in Active Users Info section should be per-user

2017-10-02 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne reassigned YARN-7245:


Assignee: Eric Payne

> In Cap Sched UI, Max AM Resource column in Active Users Info section should 
> be per-user
> ---
>
> Key: YARN-7245
> URL: https://issues.apache.org/jira/browse/YARN-7245
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.9.0, 2.8.1, 3.0.0-alpha4
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: CapSched UI Showing Inaccurate Per User Max AM 
> Resource.png
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-4163) Audit getQueueInfo and getApplications calls

2017-09-29 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-4163:
-
Attachment: YARN-4163.004.patch

I'm uploading YARN-4163.004.patch to upmerge the patch to trunk. Also, this 
patch addresses [~jlowe]'s review comments.

> Audit getQueueInfo and getApplications calls
> 
>
> Key: YARN-4163
> URL: https://issues.apache.org/jira/browse/YARN-4163
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4163.004.patch, YARN-4163.2.patch, 
> YARN-4163.2.patch, YARN-4163.3.patch, YARN-4163.patch
>
>
> getQueueInfo and getApplications seem to sometimes cause spike of load but 
> not able to confirm due to they are not audit logged. This patch propose to 
> add them to audit log



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7084) TestSchedulingMonitor#testRMStarts fails sporadically

2017-09-29 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185991#comment-16185991
 ] 

Eric Payne commented on YARN-7084:
--

Thanks [~jlowe] for reporting the issue and the fix. The strategy and fix LGTM
+1

Will commit soon.

> TestSchedulingMonitor#testRMStarts fails sporadically
> -
>
> Key: YARN-7084
> URL: https://issues.apache.org/jira/browse/YARN-7084
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 2.7.4, 3.0.0-alpha4, 2.8.2
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-7084.001.patch
>
>
> TestSchedulingMonitor has been failing sporadically in precommit builds.  
> Failures look like this:
> {noformat}
> Running 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.TestSchedulingMonitor
> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.802 sec <<< 
> FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.TestSchedulingMonitor
> testRMStarts(org.apache.hadoop.yarn.server.resourcemanager.monitor.TestSchedulingMonitor)
>   Time elapsed: 1.728 sec  <<< FAILURE!
> org.mockito.exceptions.verification.WantedButNotInvoked: 
> Wanted but not invoked:
> schedulingEditPolicy.editSchedule();
> -> at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.TestSchedulingMonitor.testRMStarts(TestSchedulingMonitor.java:58)
> However, there were other interactions with this mock:
> -> at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.(SchedulingMonitor.java:50)
> -> at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.serviceInit(SchedulingMonitor.java:61)
> -> at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.serviceInit(SchedulingMonitor.java:62)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.TestSchedulingMonitor.testRMStarts(TestSchedulingMonitor.java:58)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7249) Fix CapacityScheduler NPE issue when a container preempted while the node is being removed

2017-09-26 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-7249:
-
Fix Version/s: 2.8.2

OK. I cherry-picked this to 2.8.2. Thanks.

> Fix CapacityScheduler NPE issue when a container preempted while the node is 
> being removed
> --
>
> Key: YARN-7249
> URL: https://issues.apache.org/jira/browse/YARN-7249
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.1
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Blocker
> Fix For: 2.8.2, 2.8.3
>
> Attachments: YARN-7249.branch-2.8.001.patch
>
>
> This issue could happen when 3 conditions satisfied:
> 1) A node is removing from scheduler.
> 2) A container running on the node is being preempted. 
> 3) A rare race condition causes scheduler pass a null node to leaf queue.
> Fix of the problem is to add a null node check inside CapacityScheduler.
> Stack trace:
> {code}
> 2017-08-31 02:51:24,748 FATAL resourcemanager.ResourceManager 
> (ResourceManager.java:run(714)) - Error in handling event type 
> KILL_RESERVED_CONTAINER to the scheduler 
> java.lang.NullPointerException 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1308)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1469)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:497)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.killReservedContainer(CapacityScheduler.java:1505)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1341)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:127)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:705)
>  
> {code}
> This is an issue only existed in 2.8.x



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7249) Fix CapacityScheduler NPE issue when a container preempted while the node is being removed

2017-09-26 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16180806#comment-16180806
 ] 

Eric Payne commented on YARN-7249:
--

{quote}
I think it should be fine: containers are properly released when 
CapacityScheduler#removeNode is called. And if parallel threads access 
scheduler: queue#completedContainer get invoked with non-null but already 
removed node, it becomes a no-op. Please let me know if you think different.
{quote}
Makes sense [~leftnoteasy]. Thanks.

+1. Will commit later today.

> Fix CapacityScheduler NPE issue when a container preempted while the node is 
> being removed
> --
>
> Key: YARN-7249
> URL: https://issues.apache.org/jira/browse/YARN-7249
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.1
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Blocker
> Attachments: YARN-7249.branch-2.8.001.patch
>
>
> This issue could happen when 3 conditions satisfied:
> 1) A node is removing from scheduler.
> 2) A container running on the node is being preempted. 
> 3) A rare race condition causes scheduler pass a null node to leaf queue.
> Fix of the problem is to add a null node check inside CapacityScheduler.
> Stack trace:
> {code}
> 2017-08-31 02:51:24,748 FATAL resourcemanager.ResourceManager 
> (ResourceManager.java:run(714)) - Error in handling event type 
> KILL_RESERVED_CONTAINER to the scheduler 
> java.lang.NullPointerException 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1308)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1469)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:497)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.killReservedContainer(CapacityScheduler.java:1505)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1341)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:127)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:705)
>  
> {code}
> This is an issue only existed in 2.8.x



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7249) Fix CapacityScheduler NPE issue when a container preempted while the node is being removed

2017-09-25 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16179653#comment-16179653
 ] 

Eric Payne commented on YARN-7249:
--

[~leftnoteasy], I recognize that calling {{queue.completedContainer}} in 
{{CapacityScheduler#completedContainerInternal}} doesn't make sense if {{node}} 
is null, but if {{queue.completedContainer}} isn't called, won't that leave 
references to the container still inside internal structures? And, for example, 
won't reserved incrimination counters un-decremented?

> Fix CapacityScheduler NPE issue when a container preempted while the node is 
> being removed
> --
>
> Key: YARN-7249
> URL: https://issues.apache.org/jira/browse/YARN-7249
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.1
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Blocker
> Attachments: YARN-7249.branch-2.8.001.patch
>
>
> This issue could happen when 3 conditions satisfied:
> 1) A node is removing from scheduler.
> 2) A container running on the node is being preempted. 
> 3) A rare race condition causes scheduler pass a null node to leaf queue.
> Fix of the problem is to add a null node check inside CapacityScheduler.
> Stack trace:
> {code}
> 2017-08-31 02:51:24,748 FATAL resourcemanager.ResourceManager 
> (ResourceManager.java:run(714)) - Error in handling event type 
> KILL_RESERVED_CONTAINER to the scheduler 
> java.lang.NullPointerException 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1308)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1469)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:497)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.killReservedContainer(CapacityScheduler.java:1505)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1341)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:127)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:705)
>  
> {code}
> This is an issue only existed in 2.8.x



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7249) Fix CapacityScheduler NPE issue when a container preempted while the node is being removed

2017-09-25 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16179553#comment-16179553
 ] 

Eric Payne commented on YARN-7249:
--

[~leftnoteasy]: Sure. Looking now.

> Fix CapacityScheduler NPE issue when a container preempted while the node is 
> being removed
> --
>
> Key: YARN-7249
> URL: https://issues.apache.org/jira/browse/YARN-7249
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.1
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Blocker
> Attachments: YARN-7249.branch-2.8.001.patch
>
>
> This issue could happen when 3 conditions satisfied:
> 1) A node is removing from scheduler.
> 2) A container running on the node is being preempted. 
> 3) A rare race condition causes scheduler pass a null node to leaf queue.
> Fix of the problem is to add a null node check inside CapacityScheduler.
> Stack trace:
> {code}
> 2017-08-31 02:51:24,748 FATAL resourcemanager.ResourceManager 
> (ResourceManager.java:run(714)) - Error in handling event type 
> KILL_RESERVED_CONTAINER to the scheduler 
> java.lang.NullPointerException 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1308)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1469)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:497)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.killReservedContainer(CapacityScheduler.java:1505)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1341)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:127)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:705)
>  
> {code}
> This is an issue only existed in 2.8.x



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4163) Audit getQueueInfo and getApplications calls

2017-09-25 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16179253#comment-16179253
 ] 

Eric Payne commented on YARN-4163:
--

[~lichangleo], please let me know if you plan on up-merging the patch and 
addressing the above comments. If you need help, please let me know.

> Audit getQueueInfo and getApplications calls
> 
>
> Key: YARN-4163
> URL: https://issues.apache.org/jira/browse/YARN-4163
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Chang Li
>Assignee: Chang Li
> Attachments: YARN-4163.2.patch, YARN-4163.2.patch, YARN-4163.3.patch, 
> YARN-4163.patch
>
>
> getQueueInfo and getApplications seem to sometimes cause spike of load but 
> not able to confirm due to they are not audit logged. This patch propose to 
> add them to audit log



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7245) In Cap Sched UI, Max AM Resource column in Active Users Info section should be per-user

2017-09-22 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-7245:
-
Attachment: CapSched UI Showing Inaccurate Per User Max AM Resource.png

The value in the {{Max AM Resource}} column in the {{Active Users Info}} 
section of the Capacity scheduler UI contains the value for {{Max Application 
Master Resources}}, which is the max for the whole queue. It should be the 
{{Max Application Master Resources Per User}} value, which is the max AM 
resources that a single user can use.

See the attached screenshot.

> In Cap Sched UI, Max AM Resource column in Active Users Info section should 
> be per-user
> ---
>
> Key: YARN-7245
> URL: https://issues.apache.org/jira/browse/YARN-7245
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 2.9.0, 2.8.1, 3.0.0-alpha4
>Reporter: Eric Payne
> Attachments: CapSched UI Showing Inaccurate Per User Max AM 
> Resource.png
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7245) In Cap Sched UI, Max AM Resource column in Active Users Info section should be per-user

2017-09-22 Thread Eric Payne (JIRA)
Eric Payne created YARN-7245:


 Summary: In Cap Sched UI, Max AM Resource column in Active Users 
Info section should be per-user
 Key: YARN-7245
 URL: https://issues.apache.org/jira/browse/YARN-7245
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler, yarn
Affects Versions: 3.0.0-alpha4, 2.8.1, 2.9.0
Reporter: Eric Payne






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7116) CapacityScheduler Web UI: Queue's AM usage is always show on per-user's AM usage.

2017-09-22 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-7116:
-
Fix Version/s: 2.8.3

> CapacityScheduler Web UI: Queue's AM usage is always show on per-user's AM 
> usage.
> -
>
> Key: YARN-7116
> URL: https://issues.apache.org/jira/browse/YARN-7116
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, webapp
>Affects Versions: 2.9.0, 2.8.1, 3.0.0-alpha4
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Fix For: 2.9.0, 3.0.0-beta1, 2.8.3
>
> Attachments: YARN-7116.001.patch
>
>
> On CapacityScheduler's web UI, AM usage of different users belong to the same 
> queue always shows queue's AM usage. 
> The root cause is: under CapacitySchedulerPage. 
> {code}
> tbody.tr().td(userInfo.getUsername())
> .td(userInfo.getUserResourceLimit().toString())
> .td(resourcesUsed.toString())
> .td(resourceUsages.getAMLimit().toString())
> .td(amUsed.toString())
> .td(Integer.toString(userInfo.getNumActiveApplications()))
> .td(Integer.toString(userInfo.getNumPendingApplications()))._();
> {code}
> Instead of amUsed.toString(), it should use userInfo.getAmUsed().



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7116) CapacityScheduler Web UI: Queue's AM usage is always show on per-user's AM usage.

2017-09-22 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16176700#comment-16176700
 ] 

Eric Payne commented on YARN-7116:
--

[~leftnoteasy], [~sunilg]
If there are no objections, I'll backport this to 2.8.

> CapacityScheduler Web UI: Queue's AM usage is always show on per-user's AM 
> usage.
> -
>
> Key: YARN-7116
> URL: https://issues.apache.org/jira/browse/YARN-7116
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, webapp
>Affects Versions: 2.9.0, 2.8.1, 3.0.0-alpha4
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Fix For: 2.9.0, 3.0.0-beta1
>
> Attachments: YARN-7116.001.patch
>
>
> On CapacityScheduler's web UI, AM usage of different users belong to the same 
> queue always shows queue's AM usage. 
> The root cause is: under CapacitySchedulerPage. 
> {code}
> tbody.tr().td(userInfo.getUsername())
> .td(userInfo.getUserResourceLimit().toString())
> .td(resourcesUsed.toString())
> .td(resourceUsages.getAMLimit().toString())
> .td(amUsed.toString())
> .td(Integer.toString(userInfo.getNumActiveApplications()))
> .td(Integer.toString(userInfo.getNumPendingApplications()))._();
> {code}
> Instead of amUsed.toString(), it should use userInfo.getAmUsed().



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7149) Cross-queue preemption sometimes starves an underserved queue

2017-09-18 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-7149:
-
Fix Version/s: 3.1.0
   2.9.0

> Cross-queue preemption sometimes starves an underserved queue
> -
>
> Key: YARN-7149
> URL: https://issues.apache.org/jira/browse/YARN-7149
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Eric Payne
>Assignee: Eric Payne
> Fix For: 2.9.0, 3.0.0-beta1, 3.1.0
>
> Attachments: YARN-7149.001.patch, YARN-7149.002.patch, 
> YARN-7149.demo.unit-test.patch
>
>
> In branch 2 and trunk, I am consistently seeing some use cases where 
> cross-queue preemption does not happen when it should. I do not see this in 
> branch-2.8.
> Use Case:
> | | *Size* | *Minimum Container Size* |
> |MyCluster | 20 GB | 0.5 GB |
> | *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit 
> Percent (MULP)* | *User Limit Factor (ULF)* |
> |Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |
> |Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |
> - {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB)
> - {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB
> - _Note: containers are 0.5 GB._
> - Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}.
> - Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}.
> - _No more containers are ever preempted, even though {{Q2}} is far 
> underserved_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7149) Cross-queue preemption sometimes starves an underserved queue

2017-09-18 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16170157#comment-16170157
 ] 

Eric Payne commented on YARN-7149:
--

Thanks a lot [~leftnoteasy].

Also, this needs to be pulled back into branch-2. I will do that if there are 
no objections.

> Cross-queue preemption sometimes starves an underserved queue
> -
>
> Key: YARN-7149
> URL: https://issues.apache.org/jira/browse/YARN-7149
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Eric Payne
>Assignee: Eric Payne
> Fix For: 3.0.0-beta1
>
> Attachments: YARN-7149.001.patch, YARN-7149.002.patch, 
> YARN-7149.demo.unit-test.patch
>
>
> In branch 2 and trunk, I am consistently seeing some use cases where 
> cross-queue preemption does not happen when it should. I do not see this in 
> branch-2.8.
> Use Case:
> | | *Size* | *Minimum Container Size* |
> |MyCluster | 20 GB | 0.5 GB |
> | *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit 
> Percent (MULP)* | *User Limit Factor (ULF)* |
> |Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |
> |Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |
> - {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB)
> - {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB
> - _Note: containers are 0.5 GB._
> - Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}.
> - Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}.
> - _No more containers are ever preempted, even though {{Q2}} is far 
> underserved_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7149) Cross-queue preemption sometimes starves an underserved queue

2017-09-15 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16168499#comment-16168499
 ] 

Eric Payne commented on YARN-7149:
--

The following unit tests are succeeding for me in my environment:
{code}
TestOpportunisticContainerAllocatorAMService
TestZKRMStateStore
TestSubmitApplicationWithRMHA 
{code}

{{TestContainerAllocation}} was modified by this patch, and the new test is 
succeeding. The failure in 
{{TestContainerAllocation#testAMContainerAllocationWhenDNSUnavailable}} is a 
pre-existing issue: YARN-7044.

> Cross-queue preemption sometimes starves an underserved queue
> -
>
> Key: YARN-7149
> URL: https://issues.apache.org/jira/browse/YARN-7149
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-7149.001.patch, YARN-7149.002.patch, 
> YARN-7149.demo.unit-test.patch
>
>
> In branch 2 and trunk, I am consistently seeing some use cases where 
> cross-queue preemption does not happen when it should. I do not see this in 
> branch-2.8.
> Use Case:
> | | *Size* | *Minimum Container Size* |
> |MyCluster | 20 GB | 0.5 GB |
> | *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit 
> Percent (MULP)* | *User Limit Factor (ULF)* |
> |Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |
> |Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |
> - {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB)
> - {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB
> - _Note: containers are 0.5 GB._
> - Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}.
> - Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}.
> - _No more containers are ever preempted, even though {{Q2}} is far 
> underserved_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7149) Cross-queue preemption sometimes starves an underserved queue

2017-09-15 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-7149:
-
Attachment: YARN-7149.002.patch

bq. Do you think does it make sense to merge the 
{{YARN-7149.demo.unit-test.patch}} to your patch?
Thanks [~leftnoteasy]. I spent some time looking through the test patch to make 
sure I understand its purpose. I think it makes sense to merge it with this 
change. Attaching an updated patch.

> Cross-queue preemption sometimes starves an underserved queue
> -
>
> Key: YARN-7149
> URL: https://issues.apache.org/jira/browse/YARN-7149
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-7149.001.patch, YARN-7149.002.patch, 
> YARN-7149.demo.unit-test.patch
>
>
> In branch 2 and trunk, I am consistently seeing some use cases where 
> cross-queue preemption does not happen when it should. I do not see this in 
> branch-2.8.
> Use Case:
> | | *Size* | *Minimum Container Size* |
> |MyCluster | 20 GB | 0.5 GB |
> | *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit 
> Percent (MULP)* | *User Limit Factor (ULF)* |
> |Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |
> |Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |
> - {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB)
> - {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB
> - _Note: containers are 0.5 GB._
> - Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}.
> - Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}.
> - _No more containers are ever preempted, even though {{Q2}} is far 
> underserved_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7149) Cross-queue preemption sometimes starves an underserved queue

2017-09-14 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166286#comment-16166286
 ] 

Eric Payne commented on YARN-7149:
--

Unit test failures are not related to this patch:
{{TestAbstractYarnScheduler}}: Succeeds for me locally
{{TestContainerAllocation}}: YARN-7044

> Cross-queue preemption sometimes starves an underserved queue
> -
>
> Key: YARN-7149
> URL: https://issues.apache.org/jira/browse/YARN-7149
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-7149.001.patch, YARN-7149.demo.unit-test.patch
>
>
> In branch 2 and trunk, I am consistently seeing some use cases where 
> cross-queue preemption does not happen when it should. I do not see this in 
> branch-2.8.
> Use Case:
> | | *Size* | *Minimum Container Size* |
> |MyCluster | 20 GB | 0.5 GB |
> | *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit 
> Percent (MULP)* | *User Limit Factor (ULF)* |
> |Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |
> |Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |
> - {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB)
> - {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB
> - _Note: containers are 0.5 GB._
> - Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}.
> - Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}.
> - _No more containers are ever preempted, even though {{Q2}} is far 
> underserved_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7149) Cross-queue preemption sometimes starves an underserved queue

2017-09-13 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-7149:
-
Attachment: YARN-7149.001.patch

Rather than use this JIRA to revert the {{computeUserLimit}} behavior to 
pre-YARN-5889, patch {{YARN-7149.001.patch}} just adds {{minimumAllocation (min 
container size)}} to {{resourceUsed}}. I see this as a compromise between the 
old and the new behavior. Please let me know your thoughts.

> Cross-queue preemption sometimes starves an underserved queue
> -
>
> Key: YARN-7149
> URL: https://issues.apache.org/jira/browse/YARN-7149
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-7149.001.patch, YARN-7149.demo.unit-test.patch
>
>
> In branch 2 and trunk, I am consistently seeing some use cases where 
> cross-queue preemption does not happen when it should. I do not see this in 
> branch-2.8.
> Use Case:
> | | *Size* | *Minimum Container Size* |
> |MyCluster | 20 GB | 0.5 GB |
> | *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit 
> Percent (MULP)* | *User Limit Factor (ULF)* |
> |Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |
> |Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |
> - {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB)
> - {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB
> - _Note: containers are 0.5 GB._
> - Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}.
> - Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}.
> - _No more containers are ever preempted, even though {{Q2}} is far 
> underserved_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4727) Unable to override the $HADOOP_CONF_DIR env variable for container

2017-09-13 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165194#comment-16165194
 ] 

Eric Payne commented on YARN-4727:
--

+1 Thanks [~jlowe]

> Unable to override the $HADOOP_CONF_DIR env variable for container
> --
>
> Key: YARN-4727
> URL: https://issues.apache.org/jira/browse/YARN-4727
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.4.1, 2.5.2, 2.7.2, 2.6.4, 2.8.1
>Reporter: Terence Yim
>Assignee: Jason Lowe
> Attachments: YARN-4727.001.patch, YARN-4727.002.patch
>
>
> Given the default config of "yarn.nodemanager.env-whitelist", application 
> should be able to set the env variable $HADOOP_CONF_DIR to value other than 
> the one in the NodeManager system environment. However, I believe due to a 
> bug in the 
> {{org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch}}
>  class, it is not possible so.
> From the {{sanitizeEnv()}} method in the ContainerLaunch class 
> (https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java#L977)
> {noformat}
> putEnvIfNotNull(environment, 
> Environment.HADOOP_CONF_DIR.name(), 
> System.getenv(Environment.HADOOP_CONF_DIR.name())
> );
> if (!Shell.WINDOWS) {
>   environment.put("JVM_PID", "$$");
> }
> String[] whitelist = conf.get(YarnConfiguration.NM_ENV_WHITELIST, 
> YarnConfiguration.DEFAULT_NM_ENV_WHITELIST).split(",");
> 
> for(String whitelistEnvVariable : whitelist) {
>   putEnvIfAbsent(environment, whitelistEnvVariable.trim());
> }
> ...
> private static void putEnvIfAbsent(
> Map environment, String variable) {
>   if (environment.get(variable) == null) {
> putEnvIfNotNull(environment, variable, System.getenv(variable));
>   }
> }
> {noformat}
> So there two issues here.
> 1. the environment is already set with the system environment of the NM in 
> the {{putEnvIfNotNull}} call, hence the {{putEnvIfAbsent}} call will never 
> set it to some new value
> 2. Inside the {{putEnvIfAbsent}} call, it uses the system environment of the 
> NM, which it should be using the one from the {{launchContext}} instead.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7149) Cross-queue preemption sometimes starves an underserved queue

2017-09-11 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16161546#comment-16161546
 ] 

Eric Payne commented on YARN-7149:
--

bq. You could check the unit test code to see if that matches your expectation.
I see that the patch for YARN-5889 needed to change the headroom usage in 
{{TestLeafQueue}} for Assersions in {{testComputeUserLimitAndSetHeadroom}} and 
{{testHeadroomWithMaxCap}}:
{code}
@@ -1123,9 +1129,9 @@ public void testComputeUserLimitAndSetHeadroom() throws 
IOException {
 //testcase3 still active - 2+2+6=10
 assertEquals(10*GB, qb.getUsedResources().getMemorySize());
 //app4 is user 0
-//maxqueue 16G, userlimit 13G, used 8G, headroom 5G
+//maxqueue 16G, userlimit 7G, used 8G, headroom 5G
 //(8G used is 6G from this test case - app4, 2 from last test case, app_1)
-assertEquals(5*GB, app_4.getHeadroom().getMemorySize());
+assertEquals(0*GB, app_4.getHeadroom().getMemorySize());
   }

   @Test
@@ -1309,8 +1315,8 @@ public void testHeadroomWithMaxCap() throws Exception {
 assertEquals(2*GB, app_0.getCurrentConsumption().getMemorySize());
 assertEquals(0*GB, app_1.getCurrentConsumption().getMemorySize());
 // TODO, fix headroom in the future patch
-assertEquals(1*GB, app_0.getHeadroom().getMemorySize());
-  // User limit = 4G, 2 in use
+assertEquals(0*GB, app_0.getHeadroom().getMemorySize());
+  // User limit = 2G, 2 in use
 assertEquals(0*GB, app_1.getHeadroom().getMemorySize());
   // the application is not yet active

@@ -1322,15 +1328,15 @@ public void testHeadroomWithMaxCap() throws Exception {
 assertEquals(3*GB, a.getUsedResources().getMemorySize());
 assertEquals(2*GB, app_0.getCurrentConsumption().getMemorySize());
 assertEquals(1*GB, app_1.getCurrentConsumption().getMemorySize());
-assertEquals(1*GB, app_0.getHeadroom().getMemorySize()); // 4G - 3G
-assertEquals(1*GB, app_1.getHeadroom().getMemorySize()); // 4G - 3G
+assertEquals(0*GB, app_0.getHeadroom().getMemorySize()); // 4G - 3G
+assertEquals(0*GB, app_1.getHeadroom().getMemorySize()); // 4G - 3G

 // Submit requests for app_1 and set max-cap
 a.setMaxCapacity(.1f);
 app_2.updateResourceRequests(Collections.singletonList(
{code} 

> Cross-queue preemption sometimes starves an underserved queue
> -
>
> Key: YARN-7149
> URL: https://issues.apache.org/jira/browse/YARN-7149
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-7149.demo.unit-test.patch
>
>
> In branch 2 and trunk, I am consistently seeing some use cases where 
> cross-queue preemption does not happen when it should. I do not see this in 
> branch-2.8.
> Use Case:
> | | *Size* | *Minimum Container Size* |
> |MyCluster | 20 GB | 0.5 GB |
> | *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit 
> Percent (MULP)* | *User Limit Factor (ULF)* |
> |Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |
> |Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |
> - {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB)
> - {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB
> - _Note: containers are 0.5 GB._
> - Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}.
> - Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}.
> - _No more containers are ever preempted, even though {{Q2}} is far 
> underserved_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6248) user is not removed from UsersManager’s when app is killed with pending container requests.

2017-09-07 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-6248:
-
Fix Version/s: 2.9.0

> user is not removed from UsersManager’s when app is killed with pending 
> container requests.
> ---
>
> Key: YARN-6248
> URL: https://issues.apache.org/jira/browse/YARN-6248
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha4
>Reporter: Eric Payne
>Assignee: Eric Payne
> Fix For: 2.9.0, 3.0.0-alpha4
>
> Attachments: User Left Over.jpg, YARN-6248.001.patch
>
>
> If an app is still asking for resources when it is killed, the user is left 
> in the UsersManager structure and shows up on the GUI.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6248) user is not removed from UsersManager’s when app is killed with pending container requests.

2017-09-07 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16157522#comment-16157522
 ] 

Eric Payne commented on YARN-6248:
--

I'm seeing this in branch-2 (2.9.0) as well. I will backport.

> user is not removed from UsersManager’s when app is killed with pending 
> container requests.
> ---
>
> Key: YARN-6248
> URL: https://issues.apache.org/jira/browse/YARN-6248
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha4
>Reporter: Eric Payne
>Assignee: Eric Payne
> Fix For: 3.0.0-alpha4
>
> Attachments: User Left Over.jpg, YARN-6248.001.patch
>
>
> If an app is still asking for resources when it is killed, the user is left 
> in the UsersManager structure and shows up on the GUI.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7149) Cross-queue preemption sometimes starves an underserved queue

2017-09-07 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16157471#comment-16157471
 ] 

Eric Payne commented on YARN-7149:
--

bq. Yes you're correct, The max op is consistent with old behavior, we don't 
need to change it to min.
[~leftnoteasy], Sorry, but I'm still confused about what behavior is desired. 
IMHO, the old behavior was more consistent with the expectations of the MULP in 
a capacity scheduler. That is, the first users with asking apps are elevated to 
their user limit as quickly as possible in a FIFO order. So, the thing I'm 
confused about is what the use case would be for raising all asking users more 
evenly in a capacity scheduler context. It seems to me that the latter could 
sometimes prevent any user from achieving its user limit. Thanks!


> Cross-queue preemption sometimes starves an underserved queue
> -
>
> Key: YARN-7149
> URL: https://issues.apache.org/jira/browse/YARN-7149
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-7149.demo.unit-test.patch
>
>
> In branch 2 and trunk, I am consistently seeing some use cases where 
> cross-queue preemption does not happen when it should. I do not see this in 
> branch-2.8.
> Use Case:
> | | *Size* | *Minimum Container Size* |
> |MyCluster | 20 GB | 0.5 GB |
> | *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit 
> Percent (MULP)* | *User Limit Factor (ULF)* |
> |Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |
> |Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |
> - {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB)
> - {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB
> - _Note: containers are 0.5 GB._
> - Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}.
> - Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}.
> - _No more containers are ever preempted, even though {{Q2}} is far 
> underserved_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7149) Cross-queue preemption sometimes starves an underserved queue

2017-09-07 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16157263#comment-16157263
 ] 

Eric Payne commented on YARN-7149:
--

Thanks very much for your insights [~leftnoteasy].

bq. When we have two active users in the queue, and userLimit set to 100, first 
user will always get preferred until queue reaches maxCapacity.
I assume {{userLimit}} means {{minimum-user-limit-percent}}, correct? If so, 
then shouldn't the above statement be "first user will always get preferred 
until queue reaches {{Capacity * user-limit-factor}}"? If my assumptions are 
correct, then I think this is exactly the behavior we want. If a queue has a 
MULP of 100%, then by-definition only the user with the first active app gets 
resources. Can you please elaborate on this use case?

> Cross-queue preemption sometimes starves an underserved queue
> -
>
> Key: YARN-7149
> URL: https://issues.apache.org/jira/browse/YARN-7149
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Eric Payne
>Assignee: Eric Payne
> Attachments: YARN-7149.demo.unit-test.patch
>
>
> In branch 2 and trunk, I am consistently seeing some use cases where 
> cross-queue preemption does not happen when it should. I do not see this in 
> branch-2.8.
> Use Case:
> | | *Size* | *Minimum Container Size* |
> |MyCluster | 20 GB | 0.5 GB |
> | *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit 
> Percent (MULP)* | *User Limit Factor (ULF)* |
> |Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |
> |Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |
> - {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB)
> - {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB
> - _Note: containers are 0.5 GB._
> - Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}.
> - Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}.
> - _No more containers are ever preempted, even though {{Q2}} is far 
> underserved_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7149) Cross-queue preemption sometimes starves an underserved queue

2017-09-06 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155872#comment-16155872
 ] 

Eric Payne commented on YARN-7149:
--

bq. I would like to still pursue Jason Lowe's suggestion about reverting to the 
pre-YARN-5889 behavior.
I think I should clarify this. I think we are specifically talking about 
reverting the piece of code in {{computeUserLimit}} that boils (way) down to 
(rouphly):
{code:title=OLD}
userLimit = (queueAllUsedResources < queueGuaranteedResources) ?
(queueGuaranteedResources / #activeUsers) :
((queueAllUsedResources + minContainerSize) / #activeUsers)
{code}
{code:title=NEW}
userLimit = (queueResourcesUsedByActiveUsers / #activeUsers)
{code}

> Cross-queue preemption sometimes starves an underserved queue
> -
>
> Key: YARN-7149
> URL: https://issues.apache.org/jira/browse/YARN-7149
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Eric Payne
>Assignee: Eric Payne
>
> In branch 2 and trunk, I am consistently seeing some use cases where 
> cross-queue preemption does not happen when it should. I do not see this in 
> branch-2.8.
> Use Case:
> | | *Size* | *Minimum Container Size* |
> |MyCluster | 20 GB | 0.5 GB |
> | *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit 
> Percent (MULP)* | *User Limit Factor (ULF)* |
> |Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |
> |Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |
> - {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB)
> - {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB
> - _Note: containers are 0.5 GB._
> - Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}.
> - Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}.
> - _No more containers are ever preempted, even though {{Q2}} is far 
> underserved_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7149) Cross-queue preemption sometimes starves an underserved queue

2017-09-06 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155865#comment-16155865
 ] 

Eric Payne commented on YARN-7149:
--

Thanks for your insights, [~sunilg].

bq. Here I think we have to use {{getResourceLimitForAllUsers}} instead of 
{{getResourceLimitForActiveUsers}}
No, I don't think so. 
{{LeafQueue#getTotalPendingResourcesConsideringUserLimit}} is called during 
pre-preemption when attempting to calculate the total amount of resources that 
would be assigned to a queue if resources were to suddenly free up (by 
preemption). When these resources are freed up, the scheduler will use 
{{getResourceLimitForActiveUsers}} when deciding the amount of these resources 
to give to the queue. They should both use {{getResourceLimitForActiveUsers}}, 
which is calculating user limit for active users only.

I would like to still pursue [~jlowe]'s suggestion about reverting to the 
pre-YARN-5889 behavior.

> Cross-queue preemption sometimes starves an underserved queue
> -
>
> Key: YARN-7149
> URL: https://issues.apache.org/jira/browse/YARN-7149
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Eric Payne
>Assignee: Eric Payne
>
> In branch 2 and trunk, I am consistently seeing some use cases where 
> cross-queue preemption does not happen when it should. I do not see this in 
> branch-2.8.
> Use Case:
> | | *Size* | *Minimum Container Size* |
> |MyCluster | 20 GB | 0.5 GB |
> | *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit 
> Percent (MULP)* | *User Limit Factor (ULF)* |
> |Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |
> |Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |
> - {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB)
> - {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB
> - _Note: containers are 0.5 GB._
> - Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}.
> - Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}.
> - _No more containers are ever preempted, even though {{Q2}} is far 
> underserved_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7149) Cross-queue preemption sometimes starves an underserved queue

2017-09-05 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16153529#comment-16153529
 ] 

Eric Payne commented on YARN-7149:
--

[~sunilg], [~leftnoteasy], and [~jlowe], I would be interested in your thoughts 
on this JIRA. Thanks!

> Cross-queue preemption sometimes starves an underserved queue
> -
>
> Key: YARN-7149
> URL: https://issues.apache.org/jira/browse/YARN-7149
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Eric Payne
>Assignee: Eric Payne
>
> In branch 2 and trunk, I am consistently seeing some use cases where 
> cross-queue preemption does not happen when it should. I do not see this in 
> branch-2.8.
> Use Case:
> | | *Size* | *Minimum Container Size* |
> |MyCluster | 20 GB | 0.5 GB |
> | *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit 
> Percent (MULP)* | *User Limit Factor (ULF)* |
> |Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |
> |Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |
> - {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB)
> - {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB
> - _Note: containers are 0.5 GB._
> - Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}.
> - Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}.
> - _No more containers are ever preempted, even though {{Q2}} is far 
> underserved_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7149) Cross-queue preemption sometimes starves an underserved queue

2017-09-01 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16151071#comment-16151071
 ] 

Eric Payne commented on YARN-7149:
--

I believe that this is related to the refactoring done to the way User Limit is 
calculated in trunk / branch-2. I cannot reproduce the above use case prior to 
YARN-5889.

I think the reason is because prior to YARN-5889 (and in 2.8), the result of 
{{LeafQueue#computUserLimit}} was always greater than the amount used by any 
given user. After YARN-5889, the return value of 
{{UsersManager#computeUserLimit}} can be equal to the amount used by any given 
user. Then, in {{LeafQueue#getTotalPendingResourcesConsideringUserLimit}}, we 
can have a situation where {{userLimit}} - {{user.getUsed}} always equals 0, 
even when it shouldn't.


> Cross-queue preemption sometimes starves an underserved queue
> -
>
> Key: YARN-7149
> URL: https://issues.apache.org/jira/browse/YARN-7149
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.9.0, 3.0.0-alpha3
>Reporter: Eric Payne
>Assignee: Eric Payne
>
> In branch 2 and trunk, I am consistently seeing some use cases where 
> cross-queue preemption does not happen when it should. I do not see this in 
> branch-2.8.
> Use Case:
> | | *Size* | *Minimum Container Size* |
> |MyCluster | 20 GB | 0.5 GB |
> | *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit 
> Percent (MULP)* | *User Limit Factor (ULF)* |
> |Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |
> |Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |
> - {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB)
> - {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB
> - _Note: containers are 0.5 GB._
> - Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}.
> - Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}.
> - _No more containers are ever preempted, even though {{Q2}} is far 
> underserved_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7149) Cross-queue preemption sometimes starves an underserved queue

2017-09-01 Thread Eric Payne (JIRA)
Eric Payne created YARN-7149:


 Summary: Cross-queue preemption sometimes starves an underserved 
queue
 Key: YARN-7149
 URL: https://issues.apache.org/jira/browse/YARN-7149
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler
Affects Versions: 3.0.0-alpha3, 2.9.0
Reporter: Eric Payne
Assignee: Eric Payne


In branch 2 and trunk, I am consistently seeing some use cases where 
cross-queue preemption does not happen when it should. I do not see this in 
branch-2.8.

Use Case:
| | *Size* | *Minimum Container Size* |
|MyCluster | 20 GB | 0.5 GB |

| *Queue Name* | *Capacity* | *Absolute Capacity* | *Minimum User Limit Percent 
(MULP)* | *User Limit Factor (ULF)* |
|Q1 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |
|Q2 | 50% = 10 GB | 100% = 20 GB | 10% = 1 GB | 2.0 |

- {{User1}} launches {{App1}} in {{Q1}} and consumes all resources (20 GB)
- {{User2}} launches {{App2}} in {{Q2}} and requests 10 GB
- _Note: containers are 0.5 GB._
- Preemption monitor kills 2 containers (equals 1 GB) from {{App1}} in {{Q1}}.
- Capacity Scheduler assigns 2 containers (equals 1 GB) to {{App2}} in {{Q2}}.
- _No more containers are ever preempted, even though {{Q2}} is far underserved_




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7120) CapacitySchedulerPage NPE in "Aggregate scheduler counts" section

2017-08-29 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-7120:
-
Attachment: YARN-7120.001.patch

> CapacitySchedulerPage NPE in "Aggregate scheduler counts" section
> -
>
> Key: YARN-7120
> URL: https://issues.apache.org/jira/browse/YARN-7120
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 2.8.1, 3.0.0-alpha3
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: Aggregate Scheduler Counts All Sections.png, Aggregate 
> Scheduler Counts Bottom Cut Off.png, YARN-7120.001.patch
>
>
> The problem manifests itself by having the bottom part of the "Aggregated 
> scheduler counts" section cut off on the GUI and an NPE in the RM log.
> {noformat}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$HealthBlock.render(CapacitySchedulerPage.java:558)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
>   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43)
>   at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet.__(Hamlet.java:30354)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:478)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
>   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
>   at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
>   at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
>   at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
>   at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
>   at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.scheduler(RmController.java:86)
>   ... 58 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7120) CapacitySchedulerPage NPE in "Aggregate scheduler counts" section

2017-08-29 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-7120:
-
Attachment: Aggregate Scheduler Counts All Sections.png
Aggregate Scheduler Counts Bottom Cut Off.png

> CapacitySchedulerPage NPE in "Aggregate scheduler counts" section
> -
>
> Key: YARN-7120
> URL: https://issues.apache.org/jira/browse/YARN-7120
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 2.8.1, 3.0.0-alpha3
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Minor
> Attachments: Aggregate Scheduler Counts All Sections.png, Aggregate 
> Scheduler Counts Bottom Cut Off.png
>
>
> The problem manifests itself by having the bottom part of the "Aggregated 
> scheduler counts" section cut off on the GUI and an NPE in the RM log.
> {noformat}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$HealthBlock.render(CapacitySchedulerPage.java:558)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
>   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43)
>   at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet.__(Hamlet.java:30354)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:478)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
>   at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
>   at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
>   at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
>   at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
>   at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
>   at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
>   at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.scheduler(RmController.java:86)
>   ... 58 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7120) CapacitySchedulerPage NPE in "Aggregate scheduler counts" section

2017-08-29 Thread Eric Payne (JIRA)
Eric Payne created YARN-7120:


 Summary: CapacitySchedulerPage NPE in "Aggregate scheduler counts" 
section
 Key: YARN-7120
 URL: https://issues.apache.org/jira/browse/YARN-7120
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0-alpha3, 2.8.1, 2.9.0
Reporter: Eric Payne
Assignee: Eric Payne
Priority: Minor


The problem manifests itself by having the bottom part of the "Aggregated 
scheduler counts" section cut off on the GUI and an NPE in the RM log.
{noformat}
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$HealthBlock.render(CapacitySchedulerPage.java:558)
at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock$Block.subView(HtmlBlock.java:43)
at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet.__(Hamlet.java:30354)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.CapacitySchedulerPage$QueuesBlock.render(CapacitySchedulerPage.java:478)
at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
at 
org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
at 
org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
at 
org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
at org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
at 
org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
at org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
at org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.scheduler(RmController.java:86)
... 58 more
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7051) Avoid concurrent modification exception in FifoIntraQueuePreemptionPlugin

2017-08-28 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143830#comment-16143830
 ] 

Eric Payne commented on YARN-7051:
--

Thanks [~sunilg] for your help in resolving this problem.

> Avoid concurrent modification exception in FifoIntraQueuePreemptionPlugin
> -
>
> Key: YARN-7051
> URL: https://issues.apache.org/jira/browse/YARN-7051
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption, yarn
>Affects Versions: 2.9.0, 2.8.1, 3.0.0-alpha3
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Critical
> Fix For: 2.9.0, 3.0.0-beta1, 2.8.2
>
> Attachments: YARN-7051.001.patch, YARN-7051.002.patch
>
>
> {{FifoIntraQueuePreemptionPlugin#calculateUsedAMResourcesPerQueue}} has the 
> following code:
> {code}
> Collection runningApps = leafQueue.getApplications();
> Resource amUsed = Resources.createResource(0, 0);
> for (FiCaSchedulerApp app : runningApps) {
> {code}
> {{runningApps}} is unmodifiable but not concurrent. This caused the 
> preemption monitor thread to crash in the RM in one of our clusters.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7087) NM failed to perform log aggregation due to absent container

2017-08-25 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142140#comment-16142140
 ] 

Eric Payne commented on YARN-7087:
--

[~jlowe], thanks for finding, reporting, and fixing this issue.

+1. The patch LGTM.

I will commit this afternoon.

> NM failed to perform log aggregation due to absent container
> 
>
> Key: YARN-7087
> URL: https://issues.apache.org/jira/browse/YARN-7087
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 2.8.1
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Blocker
> Attachments: YARN-7087.001.patch, YARN-7087.002.patch
>
>
> Saw a case where the NM failed to aggregate the logs for a container because 
> it claimed it was absent:
> {noformat}
> 2017-08-23 18:35:38,283 [AsyncDispatcher event handler] WARN 
> logaggregation.LogAggregationService: Log aggregation cannot be started for 
> container_e07_1503326514161_502342_01_01, as its an absent container
> {noformat}
> Containers should not be allowed to disappear if they're not done being fully 
> processed by the NM.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7052) RM SchedulingMonitor gives no indication why the spawned thread crashed.

2017-08-25 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141746#comment-16141746
 ] 

Eric Payne commented on YARN-7052:
--

The following unit tests are all passing for me in my environment:
{noformat}
  org.apache.hadoop.yarn.server.resourcemanager.TestRMStoreCommands
  org.apache.hadoop.yarn.server.resourcemanager.TestSubmitApplicationWithRMHA
  org.apache.hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA
  org.apache.hadoop.yarn.server.resourcemanager.TestRMHAForNodeLabels 
{noformat}

The {{TestContainerAllocation}} unit test is the same as YARN-7044

> RM SchedulingMonitor gives no indication why the spawned thread crashed.
> 
>
> Key: YARN-7052
> URL: https://issues.apache.org/jira/browse/YARN-7052
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Critical
> Attachments: YARN-7052.001.patch
>
>
> In YARN-7051, we ran into a case where the preemption monitor thread hung 
> with no indication of why.
> The preemption monitor is started by the {{SchedulingExecutorService}} from 
> {{SchedulingMonitor#serviceStart}}. Once an uncaught throwable happens, 
> nothing ever gets the result of the future, the thread running the preemption 
> monitor never dies, and it never gets rescheduled.
> If {{HadoopExecutor}} were used, it would at least provide a 
> {{HadoopScheduledThreadPoolExecutor}} that logs the exception if one happens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7051) FifoIntraQueuePreemptionPlugin can get concurrent modification exception

2017-08-25 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141684#comment-16141684
 ] 

Eric Payne commented on YARN-7051:
--

Unit test failures are not related to this patch.
- {{TestContainerAllocation}}: YARN-7044
- {{h.y.s.r.security.TestDelegationTokenRenewer}}: YARN-5816

{code:title=LeafQueue#getAllApplications}
  public Collection getAllApplications() {
Collection apps = new HashSet(
pendingOrderingPolicy.getSchedulableEntities());
apps.addAll(orderingPolicy.getSchedulableEntities());

return Collections.unmodifiableCollection(apps);
  }
{code}
bq. {{getAllApplications}} in {{LeafQueue}} then has to be under readlock also, 
correct?
Possibly. It looks like {{HashSet#addAll}} will iterate through 
{{orderingPolicy}}, which could possibly change during the loop. However, I 
would like to have that discussion on a separate JIRA since I may be 
misinterpreting how {{addAll}} works and since the usage of 
{{getAAllApplications}} affects more than just preemption.

> FifoIntraQueuePreemptionPlugin can get concurrent modification exception
> 
>
> Key: YARN-7051
> URL: https://issues.apache.org/jira/browse/YARN-7051
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption, yarn
>Affects Versions: 2.9.0, 2.8.1, 3.0.0-alpha3
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Critical
> Attachments: YARN-7051.001.patch, YARN-7051.002.patch
>
>
> {{FifoIntraQueuePreemptionPlugin#calculateUsedAMResourcesPerQueue}} has the 
> following code:
> {code}
> Collection runningApps = leafQueue.getApplications();
> Resource amUsed = Resources.createResource(0, 0);
> for (FiCaSchedulerApp app : runningApps) {
> {code}
> {{runningApps}} is unmodifiable but not concurrent. This caused the 
> preemption monitor thread to crash in the RM in one of our clusters.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7051) FifoIntraQueuePreemptionPlugin can get concurrent modification exception

2017-08-24 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-7051:
-
Attachment: YARN-7051.002.patch

bq.  so this won't be changing while createTempAppForResCalculation is looping 
over the list.
However, I did find a race condition that throws an NPE within 
{{createTempAppForResCalculation}}.

{noformat}
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.util.resource.Resources.clone(Resources.java:155)
at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoIntraQueuePreemptionPlugin.createTempAppForResCalculation(FifoIntraQueuePreemptionPlugin.java:403)
at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoIntraQueuePreemptionPlugin.computeAppsIdealAllocation(FifoIntraQueuePreemptionPlugin.java:140)
at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.IntraQueueCandidatesSelector.computeIntraQueuePreemptionDemand(IntraQueueCandidatesSelector.java:283)
{noformat}

The reason for this is that {{perUserAMUsed}} was populated with running apps 
prior to calling {{createTempAppForResCalculation}}, but then 
{{createTempAppForResCalculation}} loops through both running and pending apps.

Attaching new patch that addresses this.

> FifoIntraQueuePreemptionPlugin can get concurrent modification exception
> 
>
> Key: YARN-7051
> URL: https://issues.apache.org/jira/browse/YARN-7051
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption, yarn
>Affects Versions: 2.9.0, 2.8.1, 3.0.0-alpha3
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Critical
> Attachments: YARN-7051.001.patch, YARN-7051.002.patch
>
>
> {{FifoIntraQueuePreemptionPlugin#calculateUsedAMResourcesPerQueue}} has the 
> following code:
> {code}
> Collection runningApps = leafQueue.getApplications();
> Resource amUsed = Resources.createResource(0, 0);
> for (FiCaSchedulerApp app : runningApps) {
> {code}
> {{runningApps}} is unmodifiable but not concurrent. This caused the 
> preemption monitor thread to crash in the RM in one of our clusters.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



<    4   5   6   7   8   9   10   11   12   13   >