[ https://issues.apache.org/jira/browse/YARN-9430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Szilard Nemeth updated YARN-9430: --------------------------------- Priority: Critical (was: Major) > Recovering containers does not check available resources on node > ---------------------------------------------------------------- > > Key: YARN-9430 > URL: https://issues.apache.org/jira/browse/YARN-9430 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Szilard Nemeth > Assignee: Szilard Nemeth > Priority: Critical > > I have a testcase that checks if some GPU devices gone offline and recovery > happens, only the containers that fit into the node's resources will be > recovered. Unfortunately, this is not the case: RM does not check available > resources on node during recovery. > *Detailed explanation:* > *Testcase:* > 1. There are 2 nodes running NodeManagers > 2. nvidia-smi is replaced with a fake bash script that reports 2 GPU devices > per node, initially. This means 4 GPU devices in the cluster altogether. > 3. RM / NM recovery is enabled > 4. The test starts off a sleep job, requesting 4 containers, 1 GPU device for > each (AM does not request GPUs) > 5. Before restart, the fake bash script is adjusted to report 1 GPU device > per node (2 in the cluster) after restart. > 6. Restart is initiated. > > *Expected behavior:* > After restart, only the AM and 2 normal containers should have been started, > as there are only 2 GPU devices in the cluster. > > *Actual behaviour:* > AM + 4 containers are allocated, this is all containers started originally > with step 4. > App id was: 1553977186701_0001 > *Logs*: > 2019-03-30 13:22:30,299 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > Processing event for appattempt_1553977186701_0001_000001 of type RECOVER > 2019-03-30 13:22:30,366 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > Added Application Attempt appattempt_1553977186701_0001_000001 to scheduler > from user: systest > 2019-03-30 13:22:30,366 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: > appattempt_1553977186701_0001_000001 is recovering. Skipping notifying > ATTEMPT_ADDED > 2019-03-30 13:22:30,367 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1553977186701_0001_000001 State change from NEW to LAUNCHED on > event = RECOVER > 2019-03-30 13:22:33,257 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: > Recovering container [container_e84_1553977186701_0001_01_000001, > CreateTime: 1553977260732, Version: 0, State: RUNNING, Capability: > <memory:1024, vCores:1>, Diagnostics: , ExitStatus: -1000, > NodeLabelExpression: Priority: 0] > 2019-03-30 13:22:33,275 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: > Recovering container [container_e84_1553977186701_0001_01_000004, > CreateTime: 1553977272802, Version: 0, State: RUNNING, Capability: > <memory:1024, vCores:1, yarn.io/gpu: 1>, Diagnostics: , ExitStatus: -1000, > NodeLabelExpression: Priority: 0] > 2019-03-30 13:22:33,275 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: > Assigned container container_e84_1553977186701_0001_01_000004 of capacity > <memory:1024, vCores:1, yarn.io/gpu: 1> on host > snemeth-gpu-2.vpc.cloudera.com:8041, which has 2 containers, <memory:2048, > vCores:2, yarn.io/gpu: 1> used and <memory:37252, vCores:6> available after > allocation > 2019-03-30 13:22:33,276 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: > Recovering container [container_e84_1553977186701_0001_01_000005, > CreateTime: 1553977272803, Version: 0, State: RUNNING, Capability: > <memory:1024, vCores:1, yarn.io/gpu: 1>, Diagnostics: , ExitStatus: -1000, > NodeLabelExpression: Priority: 0] > 2019-03-30 13:22:33,276 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > Processing container_e84_1553977186701_0001_01_000005 of type RECOVER > 2019-03-30 13:22:33,276 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_e84_1553977186701_0001_01_000005 Container Transitioned from NEW to > RUNNING > 2019-03-30 13:22:33,276 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: > Assigned container container_e84_1553977186701_0001_01_000005 of capacity > <memory:1024, vCores:1, yarn.io/gpu: 1> on host > snemeth-gpu-2.vpc.cloudera.com:8041, which has 3 containers, <memory:3072, > vCores:3, yarn.io/gpu: 2> used and <memory:36228, vCores:5, yarn.io/gpu: -1> > available after allocation > 2019-03-30 13:22:33,279 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: > Recovering container [container_e84_1553977186701_0001_01_000003, > CreateTime: 1553977272166, Version: 0, State: RUNNING, Capability: > <memory:1024, vCores:1, yarn.io/gpu: 1>, Diagnostics: , ExitStatus: -1000, > NodeLabelExpression: Priority: 0] > 2019-03-30 13:22:33,280 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > Processing container_e84_1553977186701_0001_01_000003 of type RECOVER > 2019-03-30 13:22:33,280 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_e84_1553977186701_0001_01_000003 Container Transitioned from NEW to > RUNNING > 2019-03-30 13:22:33,280 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing > event for application_1553977186701_0001 of type APP_RUNNING_ON_NODE > 2019-03-30 13:22:33,280 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: > Assigned container container_e84_1553977186701_0001_01_000003 of capacity > <memory:1024, vCores:1, yarn.io/gpu: 1> on host > snemeth-gpu-3.vpc.cloudera.com:8041, which has 2 containers, <memory:2048, > vCores:2, yarn.io/gpu: 2> used and <memory:37252, vCores:6, yarn.io/gpu: -1> > available after allocation > 2019-03-30 13:22:33,280 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: > SchedulerAttempt appattempt_1553977186701_0001_000001 is recovering > container container_e84_1553977186701_0001_01_000003 > There are multiple logs like this: > {code:java} > Assigned container container_e84_1553977186701_0001_01_000005 of capacity > <memory:1024, vCores:1, yarn.io/gpu: 1> on host > snemeth-gpu-2.vpc.cloudera.com:8041, which has 3 containers, <memory:3072, > vCores:3, yarn.io/gpu: 2> used and <memory:36228, vCores:5, yarn.io/gpu: -1> > available after allocation{code} > *Note the -1 value for the yarn.io/gpu resource!* > The issue lies in this method: > https://github.com/apache/hadoop/blob/e40e2d6ad5cbe782c3a067229270738b501ed27e/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerNode.java#L179 > The problem is that method deductUnallocatedResource does not check if the > resource of the container is subtracted from unallocated resource, the > unallocated resource remains above zero. > Here is the ResourceManager call hierarchy for the method (from top to > bottom): > {code:java} > 1. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler#handle > 2. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler#addNode > 3. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler#recoverContainersOnNode > 4. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode#recoverContainer > 5. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode#allocateContainer > 6. > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode#allocateContainer(org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainer, > boolean) > deduct is called here!{code} > *Testcase that reproduces the issue:* > *Add this testcase to TestFSSchedulerNode:* > > {code:java} > @Test > public void testRecovery() { > RMNode node = createNode(); > FSSchedulerNode schedulerNode = new FSSchedulerNode(node, false); > RMContainer container1 = createContainer(Resource.newInstance(4096, 4), > null); > RMContainer container2 = createContainer(Resource.newInstance(4096, 4), > null); > > schedulerNode.allocateContainer(container1); > schedulerNode.containerStarted(container1.getContainerId()); > schedulerNode.allocateContainer(container2); > schedulerNode.containerStarted(container2.getContainerId()); > assertEquals("All resources of node should have been allocated", > nodeResource, schedulerNode.getAllocatedResource()); > RMContainer container3 = createContainer(Resource.newInstance(1000, 1), > null); > when(container3.getState()).thenReturn(RMContainerState.NEW); > assertEquals("All resources of node should have been allocated", > nodeResource, schedulerNode.getAllocatedResource()); > > schedulerNode.recoverContainer(container3); > assertEquals("No resource should have been unallocated", > Resources.none(), schedulerNode.getUnallocatedResource()); > assertEquals("All resources of node should have been allocated", > nodeResource, schedulerNode.getAllocatedResource()); > } > {code} > > > *Result of testcase:* > {code:java} > java.lang.AssertionError: No resource should have been unallocated > Expected :<memory:0, vCores:0> > Actual :<memory:-1000, vCores:-1>{code} > *IT'S IMMEDIATELY CLEAR THAT NOT ONLY GPU (OR OTHER RESOURCE TYPES), BUT ANY > RESOURCES ARE AFFECTED BY THIS ISSUE!* > > *Possible fix:* > 1. A condition needs to be introduced to check if there is enough resources > on the node, we should proceed with the container's recovery only if this is > true. > 2. An error log should be added. For a quick look, this is seemingly enough > so no exception is required, but this needs a more thorough investigation and > manual test on cluster! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org