***UNCHECKED*** [jira] [Updated] (YARN-8771) CapacityScheduler fails to unreserve when cluster resource contains empty resource type

2018-09-19 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-8771:
--
Fix Version/s: (was: 3.0.4)

> CapacityScheduler fails to unreserve when cluster resource contains empty 
> resource type
> ---
>
> Key: YARN-8771
> URL: https://issues.apache.org/jira/browse/YARN-8771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-8771.001.patch, YARN-8771.002.patch, 
> YARN-8771.003.patch, YARN-8771.004.patch
>
>
> We found this problem when cluster is almost but not exhausted (93% used), 
> scheduler kept allocating for an app but always fail to commit, this can 
> blocking requests from other apps and parts of cluster resource can't be used.
> Reproduce this problem:
> (1) use DominantResourceCalculator
> (2) cluster resource has empty resource type, for example: gpu=0
> (3) scheduler allocates container for app1 who has reserved containers and 
> whose queue limit or user limit reached(used + required > limit). 
> Reference codes in RegularContainerAllocator#assignContainer:
> {code:java}
> // How much need to unreserve equals to:
> // max(required - headroom, amountNeedUnreserve)
> Resource headRoom = Resources.clone(currentResoureLimits.getHeadroom());
> Resource resourceNeedToUnReserve =
> Resources.max(rc, clusterResource,
> Resources.subtract(capability, headRoom),
> currentResoureLimits.getAmountNeededUnreserve());
> boolean needToUnreserve =
> Resources.greaterThan(rc, clusterResource,
> resourceNeedToUnReserve, Resources.none());
> {code}
> For example, resourceNeedToUnReserve can be <8GB, -6 cores, 0 gpu> when 
> {{headRoom=<0GB, 8 vcores, 0 gpu>}} and {{capacity=<8GB, 2 vcores, 0 gpu>}}, 
> needToUnreserve which is the result of {{Resources#greaterThan}} will be 
> {{false}}. This is not reasonable because required resource did exceed the 
> headroom and unreserve is needed.
> After that, when reaching the unreserve process in 
> RegularContainerAllocator#assignContainer, unreserve process will be skipped 
> when shouldAllocOrReserveNewContainer is true (when required containers > 
> reserved containers) and needToUnreserve is wrongly calculated to be false:
> {code:java}
> if (availableContainers > 0) {
>  if (rmContainer == null && reservationsContinueLooking
>   && node.getLabels().isEmpty()) {
>   // unreserve process can be wrongly skipped when 
> shouldAllocOrReserveNewContainer=true and needToUnreserve=false but required 
> resource did exceed the headroom
>   if (!shouldAllocOrReserveNewContainer || needToUnreserve) { 
> ... 
>   }
>  }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8771) CapacityScheduler fails to unreserve when cluster resource contains empty resource type

2018-09-19 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-8771:
---
Attachment: YARN-8771.004.patch

> CapacityScheduler fails to unreserve when cluster resource contains empty 
> resource type
> ---
>
> Key: YARN-8771
> URL: https://issues.apache.org/jira/browse/YARN-8771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8771.001.patch, YARN-8771.002.patch, 
> YARN-8771.003.patch, YARN-8771.004.patch
>
>
> We found this problem when cluster is almost but not exhausted (93% used), 
> scheduler kept allocating for an app but always fail to commit, this can 
> blocking requests from other apps and parts of cluster resource can't be used.
> Reproduce this problem:
> (1) use DominantResourceCalculator
> (2) cluster resource has empty resource type, for example: gpu=0
> (3) scheduler allocates container for app1 who has reserved containers and 
> whose queue limit or user limit reached(used + required > limit). 
> Reference codes in RegularContainerAllocator#assignContainer:
> {code:java}
> // How much need to unreserve equals to:
> // max(required - headroom, amountNeedUnreserve)
> Resource headRoom = Resources.clone(currentResoureLimits.getHeadroom());
> Resource resourceNeedToUnReserve =
> Resources.max(rc, clusterResource,
> Resources.subtract(capability, headRoom),
> currentResoureLimits.getAmountNeededUnreserve());
> boolean needToUnreserve =
> Resources.greaterThan(rc, clusterResource,
> resourceNeedToUnReserve, Resources.none());
> {code}
> For example, resourceNeedToUnReserve can be <8GB, -6 cores, 0 gpu> when 
> {{headRoom=<0GB, 8 vcores, 0 gpu>}} and {{capacity=<8GB, 2 vcores, 0 gpu>}}, 
> needToUnreserve which is the result of {{Resources#greaterThan}} will be 
> {{false}}. This is not reasonable because required resource did exceed the 
> headroom and unreserve is needed.
> After that, when reaching the unreserve process in 
> RegularContainerAllocator#assignContainer, unreserve process will be skipped 
> when shouldAllocOrReserveNewContainer is true (when required containers > 
> reserved containers) and needToUnreserve is wrongly calculated to be false:
> {code:java}
> if (availableContainers > 0) {
>  if (rmContainer == null && reservationsContinueLooking
>   && node.getLabels().isEmpty()) {
>   // unreserve process can be wrongly skipped when 
> shouldAllocOrReserveNewContainer=true and needToUnreserve=false but required 
> resource did exceed the headroom
>   if (!shouldAllocOrReserveNewContainer || needToUnreserve) { 
> ... 
>   }
>  }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8771) CapacityScheduler fails to unreserve when cluster resource contains empty resource type

2018-09-18 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-8771:
---
Attachment: YARN-8771.003.patch

> CapacityScheduler fails to unreserve when cluster resource contains empty 
> resource type
> ---
>
> Key: YARN-8771
> URL: https://issues.apache.org/jira/browse/YARN-8771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8771.001.patch, YARN-8771.002.patch, 
> YARN-8771.003.patch
>
>
> We found this problem when cluster is almost but not exhausted (93% used), 
> scheduler kept allocating for an app but always fail to commit, this can 
> blocking requests from other apps and parts of cluster resource can't be used.
> Reproduce this problem:
> (1) use DominantResourceCalculator
> (2) cluster resource has empty resource type, for example: gpu=0
> (3) scheduler allocates container for app1 who has reserved containers and 
> whose queue limit or user limit reached(used + required > limit). 
> Reference codes in RegularContainerAllocator#assignContainer:
> {code:java}
> // How much need to unreserve equals to:
> // max(required - headroom, amountNeedUnreserve)
> Resource headRoom = Resources.clone(currentResoureLimits.getHeadroom());
> Resource resourceNeedToUnReserve =
> Resources.max(rc, clusterResource,
> Resources.subtract(capability, headRoom),
> currentResoureLimits.getAmountNeededUnreserve());
> boolean needToUnreserve =
> Resources.greaterThan(rc, clusterResource,
> resourceNeedToUnReserve, Resources.none());
> {code}
> For example, resourceNeedToUnReserve can be <8GB, -6 cores, 0 gpu> when 
> {{headRoom=<0GB, 8 vcores, 0 gpu>}} and {{capacity=<8GB, 2 vcores, 0 gpu>}}, 
> needToUnreserve which is the result of {{Resources#greaterThan}} will be 
> {{false}}. This is not reasonable because required resource did exceed the 
> headroom and unreserve is needed.
> After that, when reaching the unreserve process in 
> RegularContainerAllocator#assignContainer, unreserve process will be skipped 
> when shouldAllocOrReserveNewContainer is true (when required containers > 
> reserved containers) and needToUnreserve is wrongly calculated to be false:
> {code:java}
> if (availableContainers > 0) {
>  if (rmContainer == null && reservationsContinueLooking
>   && node.getLabels().isEmpty()) {
>   // unreserve process can be wrongly skipped when 
> shouldAllocOrReserveNewContainer=true and needToUnreserve=false but required 
> resource did exceed the headroom
>   if (!shouldAllocOrReserveNewContainer || needToUnreserve) { 
> ... 
>   }
>  }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8771) CapacityScheduler fails to unreserve when cluster resource contains empty resource type

2018-09-18 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8771:
-
Target Version/s: 3.1.1, 3.2.0

> CapacityScheduler fails to unreserve when cluster resource contains empty 
> resource type
> ---
>
> Key: YARN-8771
> URL: https://issues.apache.org/jira/browse/YARN-8771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8771.001.patch, YARN-8771.002.patch
>
>
> We found this problem when cluster is almost but not exhausted (93% used), 
> scheduler kept allocating for an app but always fail to commit, this can 
> blocking requests from other apps and parts of cluster resource can't be used.
> Reproduce this problem:
> (1) use DominantResourceCalculator
> (2) cluster resource has empty resource type, for example: gpu=0
> (3) scheduler allocates container for app1 who has reserved containers and 
> whose queue limit or user limit reached(used + required > limit). 
> Reference codes in RegularContainerAllocator#assignContainer:
> {code:java}
> // How much need to unreserve equals to:
> // max(required - headroom, amountNeedUnreserve)
> Resource headRoom = Resources.clone(currentResoureLimits.getHeadroom());
> Resource resourceNeedToUnReserve =
> Resources.max(rc, clusterResource,
> Resources.subtract(capability, headRoom),
> currentResoureLimits.getAmountNeededUnreserve());
> boolean needToUnreserve =
> Resources.greaterThan(rc, clusterResource,
> resourceNeedToUnReserve, Resources.none());
> {code}
> For example, resourceNeedToUnReserve can be <8GB, -6 cores, 0 gpu> when 
> {{headRoom=<0GB, 8 vcores, 0 gpu>}} and {{capacity=<8GB, 2 vcores, 0 gpu>}}, 
> needToUnreserve which is the result of {{Resources#greaterThan}} will be 
> {{false}}. This is not reasonable because required resource did exceed the 
> headroom and unreserve is needed.
> After that, when reaching the unreserve process in 
> RegularContainerAllocator#assignContainer, unreserve process will be skipped 
> when shouldAllocOrReserveNewContainer is true (when required containers > 
> reserved containers) and needToUnreserve is wrongly calculated to be false:
> {code:java}
> if (availableContainers > 0) {
>  if (rmContainer == null && reservationsContinueLooking
>   && node.getLabels().isEmpty()) {
>   // unreserve process can be wrongly skipped when 
> shouldAllocOrReserveNewContainer=true and needToUnreserve=false but required 
> resource did exceed the headroom
>   if (!shouldAllocOrReserveNewContainer || needToUnreserve) { 
> ... 
>   }
>  }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8771) CapacityScheduler fails to unreserve when cluster resource contains empty resource type

2018-09-16 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-8771:
---
Description: 
We found this problem when cluster is almost but not exhausted (93% used), 
scheduler kept allocating for an app but always fail to commit, this can 
blocking requests from other apps and parts of cluster resource can't be used.

Reproduce this problem:
(1) use DominantResourceCalculator
(2) cluster resource has empty resource type, for example: gpu=0
(3) scheduler allocates container for app1 who has reserved containers and 
whose queue limit or user limit reached(used + required > limit). 

Reference codes in RegularContainerAllocator#assignContainer:
{code:java}
// How much need to unreserve equals to:
// max(required - headroom, amountNeedUnreserve)
Resource headRoom = Resources.clone(currentResoureLimits.getHeadroom());
Resource resourceNeedToUnReserve =
Resources.max(rc, clusterResource,
Resources.subtract(capability, headRoom),
currentResoureLimits.getAmountNeededUnreserve());

boolean needToUnreserve =
Resources.greaterThan(rc, clusterResource,
resourceNeedToUnReserve, Resources.none());
{code}
For example, resourceNeedToUnReserve can be <8GB, -6 cores, 0 gpu> when 
{{headRoom=<0GB, 8 vcores, 0 gpu>}} and {{capacity=<8GB, 2 vcores, 0 gpu>}}, 
needToUnreserve which is the result of {{Resources#greaterThan}} will be 
{{false}}. This is not reasonable because required resource did exceed the 
headroom and unreserve is needed.
After that, when reaching the unreserve process in 
RegularContainerAllocator#assignContainer, unreserve process will be skipped 
when shouldAllocOrReserveNewContainer is true (when required containers > 
reserved containers) and needToUnreserve is wrongly calculated to be false:
{code:java}
if (availableContainers > 0) {
 if (rmContainer == null && reservationsContinueLooking
  && node.getLabels().isEmpty()) {
  // unreserve process can be wrongly skipped when 
shouldAllocOrReserveNewContainer=true and needToUnreserve=false but required 
resource did exceed the headroom
  if (!shouldAllocOrReserveNewContainer || needToUnreserve) { 
... 
  }
 }
}
{code}

  was:
We found this problem when cluster is almost but not exhausted (93% used), 
scheduler kept allocating for an app but always fail to commit, this can 
blocking requests from other apps and parts of cluster resource can't be used.

Reproduce this problem:
(1) use DominantResourceCalculator
(2) cluster resource has empty resource type, for example: gpu=0
(3) scheduler allocates container for app1 who has reserved containers and 
whose queue limit or user limit reached(used + required > limit). 

Reference codes in RegularContainerAllocator#assignContainer:
{code:java}
// How much need to unreserve equals to:
// max(required - headroom, amountNeedUnreserve)
Resource headRoom = Resources.clone(currentResoureLimits.getHeadroom());
Resource resourceNeedToUnReserve =
Resources.max(rc, clusterResource,
Resources.subtract(capability, headRoom),
currentResoureLimits.getAmountNeededUnreserve());

boolean needToUnreserve =
Resources.greaterThan(rc, clusterResource,
resourceNeedToUnReserve, Resources.none());
{code}
For example, resourceNeedToUnReserve can be <8GB, -6 cores, 0 gpu> when 
{{headRoom=<0GB, 8 vcores, 0 gpu>}} and {{capacity=<8GB, 2 vcores, 0 gpu>}}, 
needToUnreserve which is the result of {{Resources#greaterThan}} will be 
{{false}}. This is not reasonable because required resource did exceed the 
headroom and unreserve is needed.
After that, when reaching the unreserve process in 
RegularContainerAllocator#assignContainer, unreserve process will be skipped 
when shouldAllocOrReserveNewContainer is true (when required containers > 
reserved containers) and needToUnreserve is wrongly calculated to be false:
{code:java}
if (availableContainers > 0) {
 if (rmContainer == null && reservationsContinueLooking
  && node.getLabels().isEmpty()) {
  if (!shouldAllocOrReserveNewContainer || needToUnreserve) {
...// unreserve process can be wrongly skipped here!!!
  }
 }
}
{code}


> CapacityScheduler fails to unreserve when cluster resource contains empty 
> resource type
> ---
>
> Key: YARN-8771
> URL: https://issues.apache.org/jira/browse/YARN-8771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: 

[jira] [Updated] (YARN-8771) CapacityScheduler fails to unreserve when cluster resource contains empty resource type

2018-09-16 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-8771:
---
Description: 
We found this problem when cluster is almost but not exhausted (93% used), 
scheduler kept allocating for an app but always fail to commit, this can 
blocking requests from other apps and parts of cluster resource can't be used.

Reproduce this problem:
(1) use DominantResourceCalculator
(2) cluster resource has empty resource type, for example: gpu=0
(3) scheduler allocates container for app1 who has reserved containers and 
whose queue limit or user limit reached(used + required > limit). 

Reference codes in RegularContainerAllocator#assignContainer:
{code:java}
// How much need to unreserve equals to:
// max(required - headroom, amountNeedUnreserve)
Resource headRoom = Resources.clone(currentResoureLimits.getHeadroom());
Resource resourceNeedToUnReserve =
Resources.max(rc, clusterResource,
Resources.subtract(capability, headRoom),
currentResoureLimits.getAmountNeededUnreserve());

boolean needToUnreserve =
Resources.greaterThan(rc, clusterResource,
resourceNeedToUnReserve, Resources.none());
{code}
For example, resourceNeedToUnReserve can be <8GB, -6 cores, 0 gpu> when 
{{headRoom=<0GB, 8 vcores, 0 gpu>}} and {{capacity=<8GB, 2 vcores, 0 gpu>}}, 
needToUnreserve which is the result of {{Resources#greaterThan}} will be 
{{false}}. This is not reasonable because required resource did exceed the 
headroom and unreserve is needed.
After that, when reaching the unreserve process in 
RegularContainerAllocator#assignContainer, unreserve process will be skipped 
when shouldAllocOrReserveNewContainer is true (when required containers > 
reserved containers) and needToUnreserve is wrongly calculated to be false:
{code:java}
if (availableContainers > 0) {
 if (rmContainer == null && reservationsContinueLooking
  && node.getLabels().isEmpty()) {
  if (!shouldAllocOrReserveNewContainer || needToUnreserve) {
...// unreserve process can be wrongly skipped here!!!
  }
 }
}
{code}

  was:
We found this problem when cluster is almost but not exhausted (93% used), 
scheduler kept allocating for an app but always fail to commit, this can 
blocking requests from other apps and parts of cluster resource can't be used.

Reproduce this problem:
(1) use DominantResourceCalculator
(2) cluster resource has empty resource type, for example: gpu=0
(3) scheduler allocates container for app1 who has reserved containers and 
whose queue limit or user limit reached(used + required > limit). 

Reference codes in RegularContainerAllocator#assignContainer:
{code:java}
// How much need to unreserve equals to:
// max(required - headroom, amountNeedUnreserve)
Resource headRoom = Resources.clone(currentResoureLimits.getHeadroom());
Resource resourceNeedToUnReserve =
Resources.max(rc, clusterResource,
Resources.subtract(capability, headRoom),
currentResoureLimits.getAmountNeededUnreserve());

boolean needToUnreserve =
Resources.greaterThan(rc, clusterResource,
resourceNeedToUnReserve, Resources.none());
{code}
For example, value of resourceNeedToUnReserve can be <8GB, -6 cores, 0 gpu> 
when {{headRoom=<0GB, 8 vcores, 0 gpu>}} and {{capacity=<8GB, 2 vcores, 0 
gpu>}}, needToUnreserve which is the result of {{Resources#greaterThan}} will 
be {{false}} if using DominantResourceCalculator.  This is the not reasonable 
because required resource did exceed the headroom and unreserve is needed. 
After that, when reaching the unreserve process in 
RegularContainerAllocator#assignContainer, unreserve process will be skipped 
when shouldAllocOrReserveNewContainer is true (when required containers > 
reserved containers) and needToUnreserve is wrongly calculated to be false:
{code:java}
if (availableContainers > 0) {
 if (rmContainer == null && reservationsContinueLooking
  && node.getLabels().isEmpty()) {
  if (!shouldAllocOrReserveNewContainer || needToUnreserve) {
...// unreserve process can be wrongly skipped here!!!
  }
 }
}
{code}


> CapacityScheduler fails to unreserve when cluster resource contains empty 
> resource type
> ---
>
> Key: YARN-8771
> URL: https://issues.apache.org/jira/browse/YARN-8771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8771.001.patch, YARN-8771.002.patch
>
>
> We found this problem when cluster is 

[jira] [Updated] (YARN-8771) CapacityScheduler fails to unreserve when cluster resource contains empty resource type

2018-09-16 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-8771:
---
Description: 
We found this problem when cluster is almost but not exhausted (93% used), 
scheduler kept allocating for an app but always fail to commit, this can 
blocking requests from other apps and parts of cluster resource can't be used.

Reproduce this problem:
(1) use DominantResourceCalculator
(2) cluster resource has empty resource type, for example: gpu=0
(3) scheduler allocates container for app1 who has reserved containers and 
whose queue limit or user limit reached(used + required > limit). 

Reference codes in RegularContainerAllocator#assignContainer:
{code:java}
// How much need to unreserve equals to:
// max(required - headroom, amountNeedUnreserve)
Resource headRoom = Resources.clone(currentResoureLimits.getHeadroom());
Resource resourceNeedToUnReserve =
Resources.max(rc, clusterResource,
Resources.subtract(capability, headRoom),
currentResoureLimits.getAmountNeededUnreserve());

boolean needToUnreserve =
Resources.greaterThan(rc, clusterResource,
resourceNeedToUnReserve, Resources.none());
{code}
For example, value of resourceNeedToUnReserve can be <8GB, -6 cores, 0 gpu> 
when {{headRoom=<0GB, 8 vcores, 0 gpu>}} and {{capacity=<8GB, 2 vcores, 0 
gpu>}}, needToUnreserve which is the result of {{Resources#greaterThan}} will 
be {{false}} if using DominantResourceCalculator.  This is the not reasonable 
because required resource did exceed the headroom and unreserve is needed. 
After that, when reaching the unreserve process in 
RegularContainerAllocator#assignContainer, unreserve process will be skipped 
when shouldAllocOrReserveNewContainer is true (when required containers > 
reserved containers) and needToUnreserve is wrongly calculated to be false:
{code:java}
if (availableContainers > 0) {
 if (rmContainer == null && reservationsContinueLooking
  && node.getLabels().isEmpty()) {
  if (!shouldAllocOrReserveNewContainer || needToUnreserve) {
...// unreserve process can be wrongly skipped here!!!
  }
 }
}
{code}

  was:
We found this problem when cluster is almost but not exhausted (93% used), 
scheduler kept allocating for an app but always fail to commit, this can 
blocking requests from other apps and parts of cluster resource can't be used.

Reproduce this problem:
(1) use DominantResourceCalculator
(2) cluster resource has empty resource type, for example: gpu=0
(3) scheduler allocates container for app1 who has reserved containers and 
whose queue limit or user limit reached(used + required > limit). 

Reference codes in RegularContainerAllocator#assignContainer:
{code:java}
boolean needToUnreserve =
Resources.greaterThan(rc, clusterResource,
resourceNeedToUnReserve, Resources.none());
{code}
value of resourceNeedToUnReserve can be <8GB, -6 cores, 0 gpu>, result of 
{{Resources#greaterThan}} will be false if using DominantResourceCalculator.


> CapacityScheduler fails to unreserve when cluster resource contains empty 
> resource type
> ---
>
> Key: YARN-8771
> URL: https://issues.apache.org/jira/browse/YARN-8771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8771.001.patch, YARN-8771.002.patch
>
>
> We found this problem when cluster is almost but not exhausted (93% used), 
> scheduler kept allocating for an app but always fail to commit, this can 
> blocking requests from other apps and parts of cluster resource can't be used.
> Reproduce this problem:
> (1) use DominantResourceCalculator
> (2) cluster resource has empty resource type, for example: gpu=0
> (3) scheduler allocates container for app1 who has reserved containers and 
> whose queue limit or user limit reached(used + required > limit). 
> Reference codes in RegularContainerAllocator#assignContainer:
> {code:java}
> // How much need to unreserve equals to:
> // max(required - headroom, amountNeedUnreserve)
> Resource headRoom = Resources.clone(currentResoureLimits.getHeadroom());
> Resource resourceNeedToUnReserve =
> Resources.max(rc, clusterResource,
> Resources.subtract(capability, headRoom),
> currentResoureLimits.getAmountNeededUnreserve());
> boolean needToUnreserve =
> Resources.greaterThan(rc, clusterResource,
> resourceNeedToUnReserve, Resources.none());
> {code}
> For example, value of resourceNeedToUnReserve can be <8GB, -6 cores, 0 gpu> 
> when 

[jira] [Updated] (YARN-8771) CapacityScheduler fails to unreserve when cluster resource contains empty resource type

2018-09-13 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-8771:
---
Attachment: YARN-8771.002.patch

> CapacityScheduler fails to unreserve when cluster resource contains empty 
> resource type
> ---
>
> Key: YARN-8771
> URL: https://issues.apache.org/jira/browse/YARN-8771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8771.001.patch, YARN-8771.002.patch
>
>
> We found this problem when cluster is almost but not exhausted (93% used), 
> scheduler kept allocating for an app but always fail to commit, this can 
> blocking requests from other apps and parts of cluster resource can't be used.
> Reproduce this problem:
> (1) use DominantResourceCalculator
> (2) cluster resource has empty resource type, for example: gpu=0
> (3) scheduler allocates container for app1 who has reserved containers and 
> whose queue limit or user limit reached(used + required > limit). 
> Reference codes in RegularContainerAllocator#assignContainer:
> {code:java}
> boolean needToUnreserve =
> Resources.greaterThan(rc, clusterResource,
> resourceNeedToUnReserve, Resources.none());
> {code}
> value of resourceNeedToUnReserve can be <8GB, -6 cores, 0 gpu>, result of 
> {{Resources#greaterThan}} will be false if using DominantResourceCalculator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8771) CapacityScheduler fails to unreserve when cluster resource contains empty resource type

2018-09-13 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-8771:
---
Attachment: YARN-8771.001.patch

> CapacityScheduler fails to unreserve when cluster resource contains empty 
> resource type
> ---
>
> Key: YARN-8771
> URL: https://issues.apache.org/jira/browse/YARN-8771
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8771.001.patch
>
>
> We found this problem when cluster is almost but not exhausted (93% used), 
> scheduler kept allocating for an app but always fail to commit, this can 
> blocking requests from other apps and parts of cluster resource can't be used.
> Reproduce this problem:
> (1) use DominantResourceCalculator
> (2) cluster resource has empty resource type, for example: gpu=0
> (3) scheduler allocates container for app1 who has reserved containers and 
> whose queue limit or user limit reached(used + required > limit). 
> Reference codes in RegularContainerAllocator#assignContainer:
> {code:java}
> boolean needToUnreserve =
> Resources.greaterThan(rc, clusterResource,
> resourceNeedToUnReserve, Resources.none());
> {code}
> value of resourceNeedToUnReserve can be <8GB, -6 cores, 0 gpu>, result of 
> {{Resources#greaterThan}} will be false if using DominantResourceCalculator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org