[jira] [Updated] (YARN-9430) Recovering containers does not check available resources on node
[ https://issues.apache.org/jira/browse/YARN-9430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth updated YARN-9430: - Description: I have a testcase that checks if some GPU devices gone offline and recovery happens, only the containers that fit into the node's resources will be recovered. Unfortunately, this is not the case: RM does not check available resources on node during recovery. *Detailed explanation:* *Testcase:* 1. There are 2 nodes running NodeManagers 2. nvidia-smi is replaced with a fake bash script that reports 2 GPU devices per node, initially. This means 4 GPU devices in the cluster altogether. 3. RM / NM recovery is enabled 4. The test starts off a sleep job, requesting 4 containers, 1 GPU device for each (AM does not request GPUs) 5. Before restart, the fake bash script is adjusted to report 1 GPU device per node (2 in the cluster) after restart. 6. Restart is initiated. *Expected behavior:* After restart, only the AM and 2 normal containers should have been started, as there are only 2 GPU devices in the cluster. *Actual behaviour:* AM + 4 containers are allocated, this is all containers started originally with step 4. App id was: 1553977186701_0001 *Logs*: {code:java} 2019-03-30 13:22:30,299 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Processing event for appattempt_1553977186701_0001_01 of type RECOVER 2019-03-30 13:22:30,366 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added Application Attempt appattempt_1553977186701_0001_01 to scheduler from user: systest 2019-03-30 13:22:30,366 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: appattempt_1553977186701_0001_01 is recovering. Skipping notifying ATTEMPT_ADDED 2019-03-30 13:22:30,367 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1553977186701_0001_01 State change from NEW to LAUNCHED on event = RECOVER 2019-03-30 13:22:33,257 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Recovering container [container_e84_1553977186701_0001_01_01, CreateTime: 1553977260732, Version: 0, State: RUNNING, Capability: , Diagnostics: , ExitStatus: -1000, NodeLabelExpression: Priority: 0] 2019-03-30 13:22:33,275 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Recovering container [container_e84_1553977186701_0001_01_04, CreateTime: 1553977272802, Version: 0, State: RUNNING, Capability: , Diagnostics: , ExitStatus: -1000, NodeLabelExpression: Priority: 0] 2019-03-30 13:22:33,275 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Assigned container container_e84_1553977186701_0001_01_04 of capacity on host snemeth-gpu-2.vpc.cloudera.com:8041, which has 2 containers, used and available after allocation 2019-03-30 13:22:33,276 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Recovering container [container_e84_1553977186701_0001_01_05, CreateTime: 1553977272803, Version: 0, State: RUNNING, Capability: , Diagnostics: , ExitStatus: -1000, NodeLabelExpression: Priority: 0] 2019-03-30 13:22:33,276 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Processing container_e84_1553977186701_0001_01_05 of type RECOVER 2019-03-30 13:22:33,276 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_e84_1553977186701_0001_01_05 Container Transitioned from NEW to RUNNING 2019-03-30 13:22:33,276 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Assigned container container_e84_1553977186701_0001_01_05 of capacity on host snemeth-gpu-2.vpc.cloudera.com:8041, which has 3 containers, used and available after allocation 2019-03-30 13:22:33,279 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Recovering container [container_e84_1553977186701_0001_01_03, CreateTime: 1553977272166, Version: 0, State: RUNNING, Capability: , Diagnostics: , ExitStatus: -1000, NodeLabelExpression: Priority: 0] 2019-03-30 13:22:33,280 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Processing container_e84_1553977186701_0001_01_03 of type RECOVER 2019-03-30 13:22:33,280 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_e84_1553977186701_0001_01_03 Container Transitioned from NEW to RUNNING 2019-03-30 13:22:33,280 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1553977186701_0001 of type APP_RUNNING_ON_NODE 2019-03-30 13:22:33,280 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Assigned
[jira] [Updated] (YARN-9430) Recovering containers does not check available resources on node
[ https://issues.apache.org/jira/browse/YARN-9430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth updated YARN-9430: - Priority: Critical (was: Major) > Recovering containers does not check available resources on node > > > Key: YARN-9430 > URL: https://issues.apache.org/jira/browse/YARN-9430 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Critical > > I have a testcase that checks if some GPU devices gone offline and recovery > happens, only the containers that fit into the node's resources will be > recovered. Unfortunately, this is not the case: RM does not check available > resources on node during recovery. > *Detailed explanation:* > *Testcase:* > 1. There are 2 nodes running NodeManagers > 2. nvidia-smi is replaced with a fake bash script that reports 2 GPU devices > per node, initially. This means 4 GPU devices in the cluster altogether. > 3. RM / NM recovery is enabled > 4. The test starts off a sleep job, requesting 4 containers, 1 GPU device for > each (AM does not request GPUs) > 5. Before restart, the fake bash script is adjusted to report 1 GPU device > per node (2 in the cluster) after restart. > 6. Restart is initiated. > > *Expected behavior:* > After restart, only the AM and 2 normal containers should have been started, > as there are only 2 GPU devices in the cluster. > > *Actual behaviour:* > AM + 4 containers are allocated, this is all containers started originally > with step 4. > App id was: 1553977186701_0001 > *Logs*: > 2019-03-30 13:22:30,299 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > Processing event for appattempt_1553977186701_0001_01 of type RECOVER > 2019-03-30 13:22:30,366 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Added Application Attempt appattempt_1553977186701_0001_01 to scheduler > from user: systest > 2019-03-30 13:22:30,366 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > appattempt_1553977186701_0001_01 is recovering. Skipping notifying > ATTEMPT_ADDED > 2019-03-30 13:22:30,367 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1553977186701_0001_01 State change from NEW to LAUNCHED on > event = RECOVER > 2019-03-30 13:22:33,257 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: > Recovering container [container_e84_1553977186701_0001_01_01, > CreateTime: 1553977260732, Version: 0, State: RUNNING, Capability: > , Diagnostics: , ExitStatus: -1000, > NodeLabelExpression: Priority: 0] > 2019-03-30 13:22:33,275 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: > Recovering container [container_e84_1553977186701_0001_01_04, > CreateTime: 1553977272802, Version: 0, State: RUNNING, Capability: > , Diagnostics: , ExitStatus: -1000, > NodeLabelExpression: Priority: 0] > 2019-03-30 13:22:33,275 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: > Assigned container container_e84_1553977186701_0001_01_04 of capacity > on host > snemeth-gpu-2.vpc.cloudera.com:8041, which has 2 containers, vCores:2, yarn.io/gpu: 1> used and available after > allocation > 2019-03-30 13:22:33,276 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: > Recovering container [container_e84_1553977186701_0001_01_05, > CreateTime: 1553977272803, Version: 0, State: RUNNING, Capability: > , Diagnostics: , ExitStatus: -1000, > NodeLabelExpression: Priority: 0] > 2019-03-30 13:22:33,276 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > Processing container_e84_1553977186701_0001_01_05 of type RECOVER > 2019-03-30 13:22:33,276 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_e84_1553977186701_0001_01_05 Container Transitioned from NEW to > RUNNING > 2019-03-30 13:22:33,276 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: > Assigned container container_e84_1553977186701_0001_01_05 of capacity > on host > snemeth-gpu-2.vpc.cloudera.com:8041, which has 3 containers, vCores:3, yarn.io/gpu: 2> used and > available after allocation > 2019-03-30 13:22:33,279 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: > Recovering container [container_e84_1553977186701_0001_01_03, > CreateTime: 1553977272166, Version: 0, State: RUNNING, Capability: > , Diagnostics: , ExitStatus: -1000, > NodeLabelExpression: Priority: 0] > 2019-03-30 13:22:33,280 DEBUG >