[jira] [Updated] (YARN-9430) Recovering containers does not check available resources on node

2019-03-31 Thread Szilard Nemeth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-9430:
-
Description: 
I have a testcase that checks if some GPU devices gone offline and recovery 
happens, only the containers that fit into the node's resources will be 
recovered. Unfortunately, this is not the case: RM does not check available 
resources on node during recovery.

*Detailed explanation:*

*Testcase:* 
 1. There are 2 nodes running NodeManagers
 2. nvidia-smi is replaced with a fake bash script that reports 2 GPU devices 
per node, initially. This means 4 GPU devices in the cluster altogether.
 3. RM / NM recovery is enabled
 4. The test starts off a sleep job, requesting 4 containers, 1 GPU device for 
each (AM does not request GPUs)
 5. Before restart, the fake bash script is adjusted to report 1 GPU device per 
node (2 in the cluster) after restart.
 6. Restart is initiated.

 

*Expected behavior:* 
 After restart, only the AM and 2 normal containers should have been started, 
as there are only 2 GPU devices in the cluster.

 

*Actual behaviour:* 
 AM + 4 containers are allocated, this is all containers started originally 
with step 4.

App id was: 1553977186701_0001

*Logs*:

 
{code:java}
2019-03-30 13:22:30,299 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
Processing event for appattempt_1553977186701_0001_01 of type RECOVER
2019-03-30 13:22:30,366 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
Added Application Attempt appattempt_1553977186701_0001_01 to scheduler 
from user: systest
 2019-03-30 13:22:30,366 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
appattempt_1553977186701_0001_01 is recovering. Skipping notifying 
ATTEMPT_ADDED
 2019-03-30 13:22:30,367 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_1553977186701_0001_01 State change from NEW to LAUNCHED on event 
= RECOVER
2019-03-30 13:22:33,257 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: 
Recovering container [container_e84_1553977186701_0001_01_01, CreateTime: 
1553977260732, Version: 0, State: RUNNING, Capability: , 
Diagnostics: , ExitStatus: -1000, NodeLabelExpression: Priority: 0]
2019-03-30 13:22:33,275 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: 
Recovering container [container_e84_1553977186701_0001_01_04, CreateTime: 
1553977272802, Version: 0, State: RUNNING, Capability: , Diagnostics: , ExitStatus: -1000, NodeLabelExpression: 
Priority: 0]
2019-03-30 13:22:33,275 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: 
Assigned container container_e84_1553977186701_0001_01_04 of capacity 
 on host 
snemeth-gpu-2.vpc.cloudera.com:8041, which has 2 containers,  used and  available after 
allocation
2019-03-30 13:22:33,276 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: 
Recovering container [container_e84_1553977186701_0001_01_05, CreateTime: 
1553977272803, Version: 0, State: RUNNING, Capability: , Diagnostics: , ExitStatus: -1000, NodeLabelExpression: 
Priority: 0]
 2019-03-30 13:22:33,276 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
Processing container_e84_1553977186701_0001_01_05 of type RECOVER
 2019-03-30 13:22:33,276 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_e84_1553977186701_0001_01_05 Container Transitioned from NEW to 
RUNNING
 2019-03-30 13:22:33,276 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: 
Assigned container container_e84_1553977186701_0001_01_05 of capacity 
 on host 
snemeth-gpu-2.vpc.cloudera.com:8041, which has 3 containers,  used and  
available after allocation
2019-03-30 13:22:33,279 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: 
Recovering container [container_e84_1553977186701_0001_01_03, CreateTime: 
1553977272166, Version: 0, State: RUNNING, Capability: , Diagnostics: , ExitStatus: -1000, NodeLabelExpression: 
Priority: 0]
 2019-03-30 13:22:33,280 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
Processing container_e84_1553977186701_0001_01_03 of type RECOVER
 2019-03-30 13:22:33,280 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_e84_1553977186701_0001_01_03 Container Transitioned from NEW to 
RUNNING
 2019-03-30 13:22:33,280 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event 
for application_1553977186701_0001 of type APP_RUNNING_ON_NODE
 2019-03-30 13:22:33,280 DEBUG 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: 
Assigned 

[jira] [Updated] (YARN-9430) Recovering containers does not check available resources on node

2019-03-31 Thread Szilard Nemeth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-9430:
-
Priority: Critical  (was: Major)

> Recovering containers does not check available resources on node
> 
>
> Key: YARN-9430
> URL: https://issues.apache.org/jira/browse/YARN-9430
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Critical
>
> I have a testcase that checks if some GPU devices gone offline and recovery 
> happens, only the containers that fit into the node's resources will be 
> recovered. Unfortunately, this is not the case: RM does not check available 
> resources on node during recovery.
> *Detailed explanation:*
> *Testcase:* 
> 1. There are 2 nodes running NodeManagers
> 2. nvidia-smi is replaced with a fake bash script that reports 2 GPU devices 
> per node, initially. This means 4 GPU devices in the cluster altogether.
> 3. RM / NM recovery is enabled
> 4. The test starts off a sleep job, requesting 4 containers, 1 GPU device for 
> each (AM does not request GPUs)
> 5. Before restart, the fake bash script is adjusted to report 1 GPU device 
> per node (2 in the cluster) after restart.
> 6. Restart is initiated.
>  
> *Expected behavior:* 
> After restart, only the AM and 2 normal containers should have been started, 
> as there are only 2 GPU devices in the cluster.
>  
> *Actual behaviour:* 
> AM + 4 containers are allocated, this is all containers started originally 
> with step 4.
> App id was: 1553977186701_0001
> *Logs*:
> 2019-03-30 13:22:30,299 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> Processing event for appattempt_1553977186701_0001_01 of type RECOVER
> 2019-03-30 13:22:30,366 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Added Application Attempt appattempt_1553977186701_0001_01 to scheduler 
> from user: systest
> 2019-03-30 13:22:30,366 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> appattempt_1553977186701_0001_01 is recovering. Skipping notifying 
> ATTEMPT_ADDED
> 2019-03-30 13:22:30,367 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1553977186701_0001_01 State change from NEW to LAUNCHED on 
> event = RECOVER
> 2019-03-30 13:22:33,257 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:
>  Recovering container [container_e84_1553977186701_0001_01_01, 
> CreateTime: 1553977260732, Version: 0, State: RUNNING, Capability: 
> , Diagnostics: , ExitStatus: -1000, 
> NodeLabelExpression: Priority: 0]
> 2019-03-30 13:22:33,275 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:
>  Recovering container [container_e84_1553977186701_0001_01_04, 
> CreateTime: 1553977272802, Version: 0, State: RUNNING, Capability: 
> , Diagnostics: , ExitStatus: -1000, 
> NodeLabelExpression: Priority: 0]
> 2019-03-30 13:22:33,275 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: 
> Assigned container container_e84_1553977186701_0001_01_04 of capacity 
>  on host 
> snemeth-gpu-2.vpc.cloudera.com:8041, which has 2 containers,  vCores:2, yarn.io/gpu: 1> used and  available after 
> allocation
> 2019-03-30 13:22:33,276 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:
>  Recovering container [container_e84_1553977186701_0001_01_05, 
> CreateTime: 1553977272803, Version: 0, State: RUNNING, Capability: 
> , Diagnostics: , ExitStatus: -1000, 
> NodeLabelExpression: Priority: 0]
> 2019-03-30 13:22:33,276 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> Processing container_e84_1553977186701_0001_01_05 of type RECOVER
> 2019-03-30 13:22:33,276 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_e84_1553977186701_0001_01_05 Container Transitioned from NEW to 
> RUNNING
> 2019-03-30 13:22:33,276 DEBUG 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: 
> Assigned container container_e84_1553977186701_0001_01_05 of capacity 
>  on host 
> snemeth-gpu-2.vpc.cloudera.com:8041, which has 3 containers,  vCores:3, yarn.io/gpu: 2> used and  
> available after allocation
> 2019-03-30 13:22:33,279 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:
>  Recovering container [container_e84_1553977186701_0001_01_03, 
> CreateTime: 1553977272166, Version: 0, State: RUNNING, Capability: 
> , Diagnostics: , ExitStatus: -1000, 
> NodeLabelExpression: Priority: 0]
> 2019-03-30 13:22:33,280 DEBUG 
>