[jira] [Updated] (YARN-8771) CapacityScheduler fails to unreserve when cluster resource contains empty resource type

2018-09-16 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-8771:
---
Description: 
We found this problem when cluster is almost but not exhausted (93% used), 
scheduler kept allocating for an app but always fail to commit, this can 
blocking requests from other apps and parts of cluster resource can't be used.

Reproduce this problem:
(1) use DominantResourceCalculator
(2) cluster resource has empty resource type, for example: gpu=0
(3) scheduler allocates container for app1 who has reserved containers and 
whose queue limit or user limit reached(used + required > limit). 

Reference codes in RegularContainerAllocator#assignContainer:
{code:java}
// How much need to unreserve equals to:
// max(required - headroom, amountNeedUnreserve)
Resource headRoom = Resources.clone(currentResoureLimits.getHeadroom());
Resource resourceNeedToUnReserve =
Resources.max(rc, clusterResource,
Resources.subtract(capability, headRoom),
currentResoureLimits.getAmountNeededUnreserve());

boolean needToUnreserve =
Resources.greaterThan(rc, clusterResource,
resourceNeedToUnReserve, Resources.none());
{code}
For example, resourceNeedToUnReserve can be <8GB, -6 cores, 0 gpu> when 
{{headRoom=<0GB, 8 vcores, 0 gpu>}} and {{capacity=<8GB, 2 vcores, 0 gpu>}}, 
needToUnreserve which is the result of {{Resources#greaterThan}} will be 
{{false}}. This is not reasonable because required resource did exceed the 
headroom and unreserve is needed.
After that, when reaching the unreserve process in 
RegularContainerAllocator#assignContainer, unreserve process will be skipped 
when shouldAllocOrReserveNewContainer is true (when required containers > 
reserved containers) and needToUnreserve is wrongly calculated to be false:
{code:java}
if (availableContainers > 0) {
 if (rmContainer == null && reservationsContinueLooking
  && node.getLabels().isEmpty()) {
  // unreserve process can be wrongly skipped when 
shouldAllocOrReserveNewContainer=true and needToUnreserve=false but required 
resource did exceed the headroom
  if (!shouldAllocOrReserveNewContainer || needToUnreserve) { 
... 
  }
 }
}
{code}

  was:
We found this problem when cluster is almost but not exhausted (93% used), 
scheduler kept allocating for an app but always fail to commit, this can 
blocking requests from other apps and parts of cluster resource can't be used.

Reproduce this problem:
(1) use DominantResourceCalculator
(2) cluster resource has empty resource type, for example: gpu=0
(3) scheduler allocates container for app1 who has reserved containers and 
whose queue limit or user limit reached(used + required > limit). 

Reference codes in RegularContainerAllocator#assignContainer:
{code:java}
// How much need to unreserve equals to:
// max(required - headroom, amountNeedUnreserve)
Resource headRoom = Resources.clone(currentResoureLimits.getHeadroom());
Resource resourceNeedToUnReserve =
Resources.max(rc, clusterResource,
Resources.subtract(capability, headRoom),
currentResoureLimits.getAmountNeededUnreserve());

boolean needToUnreserve =
Resources.greaterThan(rc, clusterResource,
resourceNeedToUnReserve, Resources.none());
{code}
For example, resourceNeedToUnReserve can be <8GB, -6 cores, 0 gpu> when 
{{headRoom=<0GB, 8 vcores, 0 gpu>}} and {{capacity=<8GB, 2 vcores, 0 gpu>}}, 
needToUnreserve which is the result of {{Resources#greaterThan}} will be 
{{false}}. This is not reasonable because required resource did exceed the 
headroom and unreserve is needed.
After that, when reaching the unreserve process in 
RegularContainerAllocator#assignContainer, unreserve process will be skipped 
when shouldAllocOrReserveNewContainer is true (when required containers > 
reserved containers) and needToUnreserve is wrongly calculated to be false:
{code:java}
if (availableContainers > 0) {
 if (rmContainer == null && reservationsContinueLooking
  && node.getLabels().isEmpty()) {
  if (!shouldAllocOrReserveNewContainer || needToUnreserve) {
...// unreserve process can be wrongly skipped here!!!
  }
 }
}
{code}


> CapacityScheduler fails to unreserve when cluster resource contains empty 
> resource type
> ---
>
> Key: YARN-8771
> URL: https://issues.apache.org/jira/browse/YARN-8771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8771.001.patch

[jira] [Updated] (YARN-8771) CapacityScheduler fails to unreserve when cluster resource contains empty resource type

2018-09-16 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-8771:
---
Description: 
We found this problem when cluster is almost but not exhausted (93% used), 
scheduler kept allocating for an app but always fail to commit, this can 
blocking requests from other apps and parts of cluster resource can't be used.

Reproduce this problem:
(1) use DominantResourceCalculator
(2) cluster resource has empty resource type, for example: gpu=0
(3) scheduler allocates container for app1 who has reserved containers and 
whose queue limit or user limit reached(used + required > limit). 

Reference codes in RegularContainerAllocator#assignContainer:
{code:java}
// How much need to unreserve equals to:
// max(required - headroom, amountNeedUnreserve)
Resource headRoom = Resources.clone(currentResoureLimits.getHeadroom());
Resource resourceNeedToUnReserve =
Resources.max(rc, clusterResource,
Resources.subtract(capability, headRoom),
currentResoureLimits.getAmountNeededUnreserve());

boolean needToUnreserve =
Resources.greaterThan(rc, clusterResource,
resourceNeedToUnReserve, Resources.none());
{code}
For example, resourceNeedToUnReserve can be <8GB, -6 cores, 0 gpu> when 
{{headRoom=<0GB, 8 vcores, 0 gpu>}} and {{capacity=<8GB, 2 vcores, 0 gpu>}}, 
needToUnreserve which is the result of {{Resources#greaterThan}} will be 
{{false}}. This is not reasonable because required resource did exceed the 
headroom and unreserve is needed.
After that, when reaching the unreserve process in 
RegularContainerAllocator#assignContainer, unreserve process will be skipped 
when shouldAllocOrReserveNewContainer is true (when required containers > 
reserved containers) and needToUnreserve is wrongly calculated to be false:
{code:java}
if (availableContainers > 0) {
 if (rmContainer == null && reservationsContinueLooking
  && node.getLabels().isEmpty()) {
  if (!shouldAllocOrReserveNewContainer || needToUnreserve) {
...// unreserve process can be wrongly skipped here!!!
  }
 }
}
{code}

  was:
We found this problem when cluster is almost but not exhausted (93% used), 
scheduler kept allocating for an app but always fail to commit, this can 
blocking requests from other apps and parts of cluster resource can't be used.

Reproduce this problem:
(1) use DominantResourceCalculator
(2) cluster resource has empty resource type, for example: gpu=0
(3) scheduler allocates container for app1 who has reserved containers and 
whose queue limit or user limit reached(used + required > limit). 

Reference codes in RegularContainerAllocator#assignContainer:
{code:java}
// How much need to unreserve equals to:
// max(required - headroom, amountNeedUnreserve)
Resource headRoom = Resources.clone(currentResoureLimits.getHeadroom());
Resource resourceNeedToUnReserve =
Resources.max(rc, clusterResource,
Resources.subtract(capability, headRoom),
currentResoureLimits.getAmountNeededUnreserve());

boolean needToUnreserve =
Resources.greaterThan(rc, clusterResource,
resourceNeedToUnReserve, Resources.none());
{code}
For example, value of resourceNeedToUnReserve can be <8GB, -6 cores, 0 gpu> 
when {{headRoom=<0GB, 8 vcores, 0 gpu>}} and {{capacity=<8GB, 2 vcores, 0 
gpu>}}, needToUnreserve which is the result of {{Resources#greaterThan}} will 
be {{false}} if using DominantResourceCalculator.  This is the not reasonable 
because required resource did exceed the headroom and unreserve is needed. 
After that, when reaching the unreserve process in 
RegularContainerAllocator#assignContainer, unreserve process will be skipped 
when shouldAllocOrReserveNewContainer is true (when required containers > 
reserved containers) and needToUnreserve is wrongly calculated to be false:
{code:java}
if (availableContainers > 0) {
 if (rmContainer == null && reservationsContinueLooking
  && node.getLabels().isEmpty()) {
  if (!shouldAllocOrReserveNewContainer || needToUnreserve) {
...// unreserve process can be wrongly skipped here!!!
  }
 }
}
{code}


> CapacityScheduler fails to unreserve when cluster resource contains empty 
> resource type
> ---
>
> Key: YARN-8771
> URL: https://issues.apache.org/jira/browse/YARN-8771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8771.001.patch, YARN-8771.002.patch
>
>
> We found this problem when cluster is almos

[jira] [Updated] (YARN-8771) CapacityScheduler fails to unreserve when cluster resource contains empty resource type

2018-09-16 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-8771:
---
Description: 
We found this problem when cluster is almost but not exhausted (93% used), 
scheduler kept allocating for an app but always fail to commit, this can 
blocking requests from other apps and parts of cluster resource can't be used.

Reproduce this problem:
(1) use DominantResourceCalculator
(2) cluster resource has empty resource type, for example: gpu=0
(3) scheduler allocates container for app1 who has reserved containers and 
whose queue limit or user limit reached(used + required > limit). 

Reference codes in RegularContainerAllocator#assignContainer:
{code:java}
// How much need to unreserve equals to:
// max(required - headroom, amountNeedUnreserve)
Resource headRoom = Resources.clone(currentResoureLimits.getHeadroom());
Resource resourceNeedToUnReserve =
Resources.max(rc, clusterResource,
Resources.subtract(capability, headRoom),
currentResoureLimits.getAmountNeededUnreserve());

boolean needToUnreserve =
Resources.greaterThan(rc, clusterResource,
resourceNeedToUnReserve, Resources.none());
{code}
For example, value of resourceNeedToUnReserve can be <8GB, -6 cores, 0 gpu> 
when {{headRoom=<0GB, 8 vcores, 0 gpu>}} and {{capacity=<8GB, 2 vcores, 0 
gpu>}}, needToUnreserve which is the result of {{Resources#greaterThan}} will 
be {{false}} if using DominantResourceCalculator.  This is the not reasonable 
because required resource did exceed the headroom and unreserve is needed. 
After that, when reaching the unreserve process in 
RegularContainerAllocator#assignContainer, unreserve process will be skipped 
when shouldAllocOrReserveNewContainer is true (when required containers > 
reserved containers) and needToUnreserve is wrongly calculated to be false:
{code:java}
if (availableContainers > 0) {
 if (rmContainer == null && reservationsContinueLooking
  && node.getLabels().isEmpty()) {
  if (!shouldAllocOrReserveNewContainer || needToUnreserve) {
...// unreserve process can be wrongly skipped here!!!
  }
 }
}
{code}

  was:
We found this problem when cluster is almost but not exhausted (93% used), 
scheduler kept allocating for an app but always fail to commit, this can 
blocking requests from other apps and parts of cluster resource can't be used.

Reproduce this problem:
(1) use DominantResourceCalculator
(2) cluster resource has empty resource type, for example: gpu=0
(3) scheduler allocates container for app1 who has reserved containers and 
whose queue limit or user limit reached(used + required > limit). 

Reference codes in RegularContainerAllocator#assignContainer:
{code:java}
boolean needToUnreserve =
Resources.greaterThan(rc, clusterResource,
resourceNeedToUnReserve, Resources.none());
{code}
value of resourceNeedToUnReserve can be <8GB, -6 cores, 0 gpu>, result of 
{{Resources#greaterThan}} will be false if using DominantResourceCalculator.


> CapacityScheduler fails to unreserve when cluster resource contains empty 
> resource type
> ---
>
> Key: YARN-8771
> URL: https://issues.apache.org/jira/browse/YARN-8771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8771.001.patch, YARN-8771.002.patch
>
>
> We found this problem when cluster is almost but not exhausted (93% used), 
> scheduler kept allocating for an app but always fail to commit, this can 
> blocking requests from other apps and parts of cluster resource can't be used.
> Reproduce this problem:
> (1) use DominantResourceCalculator
> (2) cluster resource has empty resource type, for example: gpu=0
> (3) scheduler allocates container for app1 who has reserved containers and 
> whose queue limit or user limit reached(used + required > limit). 
> Reference codes in RegularContainerAllocator#assignContainer:
> {code:java}
> // How much need to unreserve equals to:
> // max(required - headroom, amountNeedUnreserve)
> Resource headRoom = Resources.clone(currentResoureLimits.getHeadroom());
> Resource resourceNeedToUnReserve =
> Resources.max(rc, clusterResource,
> Resources.subtract(capability, headRoom),
> currentResoureLimits.getAmountNeededUnreserve());
> boolean needToUnreserve =
> Resources.greaterThan(rc, clusterResource,
> resourceNeedToUnReserve, Resources.none());
> {code}
> For example, value of resourceNeedToUnReserve can be <8GB, -6 cores, 0 gpu> 
> when {{headRoom=<0GB

[jira] [Commented] (YARN-8715) Make allocation tags in the placement spec optional for node-attributes

2018-09-16 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16617089#comment-16617089
 ] 

Hudson commented on YARN-8715:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #14975 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/14975/])
YARN-8715. Make allocation tags in the placement spec optional for (sunilg: rev 
33d8327cffdc483b538aec3022fd8730b85babdb)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/test/java/org/apache/hadoop/yarn/api/resource/TestPlacementConstraintParser.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/util/constraint/PlacementConstraintParser.java


> Make allocation tags in the placement spec optional for node-attributes
> ---
>
> Key: YARN-8715
> URL: https://issues.apache.org/jira/browse/YARN-8715
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-8715.001.patch
>
>
> YARN-7863 adds support to specify constraints targeting to node-attributes, 
> including the support in distributed shell, but it still needs to specify 
> {{allocationTags=numOfContainers}} in the spec. We should make this optional 
> as it is not required for node-attribute expressions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8715) Make allocation tags in the placement spec optional for node-attributes

2018-09-16 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16617071#comment-16617071
 ] 

Weiwei Yang commented on YARN-8715:
---

Thanks [~sunilg]!

> Make allocation tags in the placement spec optional for node-attributes
> ---
>
> Key: YARN-8715
> URL: https://issues.apache.org/jira/browse/YARN-8715
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-8715.001.patch
>
>
> YARN-7863 adds support to specify constraints targeting to node-attributes, 
> including the support in distributed shell, but it still needs to specify 
> {{allocationTags=numOfContainers}} in the spec. We should make this optional 
> as it is not required for node-attribute expressions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8759) Copy of "resource-types.xml" is not deleted if test fails, causes other test failures

2018-09-16 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16617070#comment-16617070
 ] 

Sunil Govindan commented on YARN-8759:
--

Thanks [~bsteinbach]. Looks fine. [~maniraj...@gmail.com], if you are fine with 
this patch, i could commit it later tomorrow. Thanks.

> Copy of "resource-types.xml" is not deleted if test fails, causes other test 
> failures
> -
>
> Key: YARN-8759
> URL: https://issues.apache.org/jira/browse/YARN-8759
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Antal Bálint Steinbach
>Assignee: Antal Bálint Steinbach
>Priority: Major
> Attachments: YARN-8759.001.patch, YARN-8759.002.patch, 
> YARN-8759.003.patch
>
>
> resource-types.xml is copied in several tests to the test machine, but it is 
> deleted only at the end of the test. In case the test fails the file will not 
> be deleted and other tests will fail, because of the wrong configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN

2018-09-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16617041#comment-16617041
 ] 

ASF GitHub Bot commented on YARN-1964:
--

Github user cricket007 commented on the issue:

https://github.com/apache/hadoop/pull/7
  
Should probably be closed?

Superseded by https://issues.apache.org/jira/browse/YARN-5388


> Create Docker analog of the LinuxContainerExecutor in YARN
> --
>
> Key: YARN-1964
> URL: https://issues.apache.org/jira/browse/YARN-1964
> Project: Hadoop YARN
>  Issue Type: New Feature
>Affects Versions: 2.2.0
>Reporter: Arun C Murthy
>Assignee: Abin Shahab
>Priority: Major
>  Labels: Docker
> Fix For: 2.6.0
>
> Attachments: YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, 
> YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, 
> YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, 
> yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, 
> yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, 
> yarn-1964-docker.patch, yarn-1964-docker.patch
>
>
> *This alpha feature has been deprecated in branch-2 and removed from trunk* 
> Please see https://issues.apache.org/jira/browse/YARN-5388
> Docker (https://www.docker.io/) is, increasingly, a very popular container 
> technology.
> In context of YARN, the support for Docker will provide a very elegant 
> solution to allow applications to *package* their software into a Docker 
> container (entire Linux file system incl. custom versions of perl, python 
> etc.) and use it as a blueprint to launch all their YARN containers with 
> requisite software environment. This provides both consistency (all YARN 
> containers will have the same software environment) and isolation (no 
> interference with whatever is installed on the physical machine).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN

2018-09-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16617040#comment-16617040
 ] 

ASF GitHub Bot commented on YARN-1964:
--

Github user cricket007 commented on the issue:

https://github.com/apache/hadoop/pull/6
  
Should probably be closed?

Superseded by https://issues.apache.org/jira/browse/YARN-5388


> Create Docker analog of the LinuxContainerExecutor in YARN
> --
>
> Key: YARN-1964
> URL: https://issues.apache.org/jira/browse/YARN-1964
> Project: Hadoop YARN
>  Issue Type: New Feature
>Affects Versions: 2.2.0
>Reporter: Arun C Murthy
>Assignee: Abin Shahab
>Priority: Major
>  Labels: Docker
> Fix For: 2.6.0
>
> Attachments: YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, 
> YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, 
> YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, 
> yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, 
> yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, 
> yarn-1964-docker.patch, yarn-1964-docker.patch
>
>
> *This alpha feature has been deprecated in branch-2 and removed from trunk* 
> Please see https://issues.apache.org/jira/browse/YARN-5388
> Docker (https://www.docker.io/) is, increasingly, a very popular container 
> technology.
> In context of YARN, the support for Docker will provide a very elegant 
> solution to allow applications to *package* their software into a Docker 
> container (entire Linux file system incl. custom versions of perl, python 
> etc.) and use it as a blueprint to launch all their YARN containers with 
> requisite software environment. This provides both consistency (all YARN 
> containers will have the same software environment) and isolation (no 
> interference with whatever is installed on the physical machine).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8774) Memory leak when CapacityScheduler allocates from reserved container with non-default label

2018-09-16 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-8774:
---
Affects Version/s: 2.8.5
   3.2.0

> Memory leak when CapacityScheduler allocates from reserved container with 
> non-default label
> ---
>
> Key: YARN-8774
> URL: https://issues.apache.org/jira/browse/YARN-8774
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0, 2.8.5
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8774.001.patch
>
>
> The cause is that the RMContainerImpl instance of reserved container lost its 
> node label expression, when scheduler reserves containers for non-default 
> node-label requests, it will be wrongly added into 
> LeafQueue#ignorePartitionExclusivityRMContainers and never be removed.
> To reproduce this memory leak:
> (1) create reserved container
> RegularContainerAllocator#doAllocation:  create RMContainerImpl instanceA 
> (nodeLabelExpression="")
> LeafQueue#allocateResource:  RMContainerImpl instanceA is put into  
> LeafQueue#ignorePartitionExclusivityRMContainers
> (2) allocate from reserved container
> RegularContainerAllocator#doAllocation: create RMContainerImpl instanceB 
> (nodeLabelExpression="test-label")
> (3) From now on, RMContainerImpl instanceA will be left in memory (be kept in 
> LeafQueue#ignorePartitionExclusivityRMContainers) forever until RM restarted



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8781) back-port YARN-8091 to branch-2.6.4

2018-09-16 Thread Zian Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zian Chen resolved YARN-8781.
-
Resolution: Invalid

> back-port YARN-8091 to branch-2.6.4
> ---
>
> Key: YARN-8781
> URL: https://issues.apache.org/jira/browse/YARN-8781
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.4
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Minor
> Fix For: 2.6.4
>
>
> We suggest a patch that back-ports the change 
> https://issues.apache.org/jira/browse/YARN-8091 to branch 2.6.4
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8781) back-port YARN-8091 to branch-2.6.4

2018-09-16 Thread Zian Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616961#comment-16616961
 ] 

Zian Chen commented on YARN-8781:
-

Close as invalid.

> back-port YARN-8091 to branch-2.6.4
> ---
>
> Key: YARN-8781
> URL: https://issues.apache.org/jira/browse/YARN-8781
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.4
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Minor
> Fix For: 2.6.4
>
>
> We suggest a patch that back-ports the change 
> https://issues.apache.org/jira/browse/YARN-8091 to branch 2.6.4
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8750) Refactor TestQueueMetrics

2018-09-16 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616957#comment-16616957
 ] 

Hadoop QA commented on YARN-8750:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
22s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 4 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  2m 
13s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 
42s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 14m 
25s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  2m 
48s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
13s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 51s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
51s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
29s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
18s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 15m 
12s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 15m 
12s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
2m 43s{color} | {color:orange} root: The patch generated 8 new + 93 unchanged - 
23 fixed = 101 total (was 116) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
56s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git 
apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 37s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
14s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
44s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  8m 
27s{color} | {color:green} hadoop-common in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 74m 32s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
42s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}178m 41s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.reservation.TestCapacityOverTimePolicy |
|   | hadoop.yarn.server.resourcemanager.metrics.TestSystemMetricsPublisher |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:4b8c2b1 |
| JIRA Issue | YARN-8750 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12939899/YARN-8750.002.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux ed94782e969d 4.4.0-133-generic #159-Ubuntu SMP Fri Aug 10 
07:31:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven

[jira] [Commented] (YARN-8750) Refactor TestQueueMetrics

2018-09-16 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616894#comment-16616894
 ] 

Szilard Nemeth commented on YARN-8750:
--

Fixed the whitespace issues with patch002

> Refactor TestQueueMetrics
> -
>
> Key: YARN-8750
> URL: https://issues.apache.org/jira/browse/YARN-8750
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
> Attachments: YARN-8750.001.patch, YARN-8750.002.patch
>
>
> {{TestQueueMetrics#checkApps}} and {{TestQueueMetrics#checkResources}} have 8 
> and 14 parameters, respectively.
> It is very hard to read the testcases that are using these methods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8750) Refactor TestQueueMetrics

2018-09-16 Thread Szilard Nemeth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-8750:
-
Attachment: YARN-8750.002.patch

> Refactor TestQueueMetrics
> -
>
> Key: YARN-8750
> URL: https://issues.apache.org/jira/browse/YARN-8750
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
> Attachments: YARN-8750.001.patch, YARN-8750.002.patch
>
>
> {{TestQueueMetrics#checkApps}} and {{TestQueueMetrics#checkResources}} have 8 
> and 14 parameters, respectively.
> It is very hard to read the testcases that are using these methods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8059) Resource type is ignored when FS decide to preempt

2018-09-16 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616890#comment-16616890
 ] 

Szilard Nemeth commented on YARN-8059:
--

Patch is ready to review!
The reason why this is not in patch available status is that this issue depends 
on YARN-8750.
If YARN-8750 is not merged, this patch cannot be applied to trunk.

> Resource type is ignored when FS decide to preempt
> --
>
> Key: YARN-8059
> URL: https://issues.apache.org/jira/browse/YARN-8059
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.0.0
>Reporter: Yufei Gu
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8059.001.patch
>
>
> Method Fairscheduler#shouldAttemptPreemption doesn't consider resources other 
> than vcore and memory. We may need to rethink it in the resource type 
> scenario. cc [~miklos.szeg...@cloudera.com], [~wilfreds] and [~snemeth].
> {code}
> if (context.isPreemptionEnabled()) {
>   return (context.getPreemptionUtilizationThreshold() < Math.max(
>   (float) rootMetrics.getAllocatedMB() /
>   getClusterResource().getMemorySize(),
>   (float) rootMetrics.getAllocatedVirtualCores() /
>   getClusterResource().getVirtualCores()));
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8059) Resource type is ignored when FS decide to preempt

2018-09-16 Thread Szilard Nemeth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-8059:
-
Attachment: YARN-8059.001.patch

> Resource type is ignored when FS decide to preempt
> --
>
> Key: YARN-8059
> URL: https://issues.apache.org/jira/browse/YARN-8059
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.0.0
>Reporter: Yufei Gu
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8059.001.patch
>
>
> Method Fairscheduler#shouldAttemptPreemption doesn't consider resources other 
> than vcore and memory. We may need to rethink it in the resource type 
> scenario. cc [~miklos.szeg...@cloudera.com], [~wilfreds] and [~snemeth].
> {code}
> if (context.isPreemptionEnabled()) {
>   return (context.getPreemptionUtilizationThreshold() < Math.max(
>   (float) rootMetrics.getAllocatedMB() /
>   getClusterResource().getMemorySize(),
>   (float) rootMetrics.getAllocatedVirtualCores() /
>   getClusterResource().getVirtualCores()));
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org