Benjamin Teke created YARN-10295:
------------------------------------
Summary: CapacityScheduler NPE can cause apps to get stuck without
resources
Key: YARN-10295
URL: https://issues.apache.org/jira/browse/YARN-10295
Project: Hadoop YARN
Issue Type: Bug
Components: capacityscheduler
Affects Versions: 3.2.0, 3.1.0
Reporter: Benjamin Teke
Assignee: Benjamin Teke
When the CapacityScheduler Asynchronous scheduling is enabled there is an
edge-case where a NullPointerException can cause the scheduler thread to exit
and the apps to get stuck without allocated resources. Consider the following
log:
{code:java}
2020-05-27 10:13:49,106 INFO fica.FiCaSchedulerApp
(FiCaSchedulerApp.java:apply(681)) - Reserved
container=container_e10_1590502305306_0660_01_000115, on node=host:
ctr-e148-1588963324989-31443-01-000002.hwx.site:25454 #containers=14
available=<memory:2048, vCores:11> used=<memory:182272, vCores:14> with
resource=<memory:4096, vCores:1>
2020-05-27 10:13:49,134 INFO fica.FiCaSchedulerApp
(FiCaSchedulerApp.java:internalUnreserve(743)) - Application
application_1590502305306_0660 unreserved on node host:
ctr-e148-1588963324989-31443-01-000002.hwx.site:25454 #containers=14
available=<memory:2048, vCores:11> used=<memory:182272, vCores:14>, currently
has 0 at priority 11; currentReservation <memory:0, vCores:0> on node-label=
2020-05-27 10:13:49,134 INFO capacity.CapacityScheduler
(CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted
2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler
(YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread
Thread[Thread-4953,5,main] threw an Exception.
java.lang.NullPointerException
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546)
at
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593)
{code}
A container gets allocated on a host, but the host doesn't have enough memory,
so after a short while it gets unreserved. However because the scheduler thread
is running asynchronously it might have entered into the following if block
located in
[CapacityScheduler.java#L1602|https://github.com/apache/hadoop/blob/7136ebbb7aa197717619c23a841d28f1c46ad40b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1602],
because at the time _node.getReservedContainer()_ wasn't null. Calling it a
second time for getting the ApplicationAttemptId would be an NPE, as the
container got unreserved in the meantime.
{code:java}
// Do not schedule if there are any reservations to fulfill on the node
if (node.getReservedContainer() != null) {
if (LOG.isDebugEnabled()) {
LOG.debug("Skipping scheduling since node " + node.getNodeID()
+ " is reserved by application " + node.getReservedContainer()
.getContainerId().getApplicationAttemptId());
}
return null;
}
{code}
A fix would be to store the container object before the if, and as a precaution
the org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl#getId/setId
methods should be declared synchronyzed, as they'll be accessed from multiple
threads.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]