[
https://issues.apache.org/jira/browse/YARN-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17029396#comment-17029396
]
Mikayla Konst commented on YARN-9011:
-
We experienced this exact same race condition recently (resource manager
sending SHUTDOWN signal to node manager because it received a heartbeat from
the node manager *after* the HostDetails reference was updated, but *before*
the node was transitioned to state DECOMMISSIONING).
I think this patch is a huge improvement over the previous behavior, but I
think there is still a narrow race that can happen when refresh nodes is called
multiple times in a row in quick succession with the same set of nodes in the
exclude file:
# lazy-loaded HostDetails reference is updated
# nodes are added to gracefullyDecommissionableNodes set
# current HostDetails reference is updated
# event to update node status to DECOMMISSIONING is added to asynchronous
event handler's event queue, but hasn't been processed yet
# refresh nodes is called a second time
# lazy-loaded HostDetails reference is updated
# gracefullyDecommissionableNodes set is cleared
# node manager heartbeats to resource manager. It is not in state
DECOMMISSIONING and not in the gracefullyDecommissionableNodes set, but is an
excluded node in the HostDetails, so it is sent a SHUTDOWN signal
# node is added to gracefullyDecommissionableNodes set
# event handler transitions node to state DECOMMISSIONING at some point
This would be fixed if you used an AtomicReference for your set of
"gracefullyDecommissionableNodes" and swapped out the reference, similar to how
you handled the HostDetails.
Alternatively, instead of using an asynchronous event handler to update the
state of the nodes to DECOMMISSIONING, you could update the state
synchronously. You could grab a lock, then update HostDetails and synchronously
update the states of the nodes being gracefully decommissioned, then release
the lock. When the resource tracker service receives a heartbeat and needs to
check if a node should be shutdown (if it is excluded and in state
decommissioning), it would grab the lock right before doing the check. Having
the resource tracker service wait on a lock doesn't sound great, but it would
likely be on the order of milliseconds, and only when refresh nodes is called.
> Race condition during decommissioning
> -
>
> Key: YARN-9011
> URL: https://issues.apache.org/jira/browse/YARN-9011
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
>Affects Versions: 3.1.1
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 3.3.0, 3.2.2, 3.1.4
>
> Attachments: YARN-9011-001.patch, YARN-9011-002.patch,
> YARN-9011-003.patch, YARN-9011-004.patch, YARN-9011-005.patch,
> YARN-9011-006.patch, YARN-9011-007.patch, YARN-9011-008.patch,
> YARN-9011-009.patch, YARN-9011-branch-3.1.001.patch,
> YARN-9011-branch-3.2.001.patch
>
>
> During internal testing, we found a nasty race condition which occurs during
> decommissioning.
> Node manager, incorrect behaviour:
> {noformat}
> 2018-06-18 21:00:17,634 WARN
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting
> down.
> 2018-06-18 21:00:17,634 WARN
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from
> ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041
> hostname:node-6.hostname.com
> {noformat}
> Node manager, expected behaviour:
> {noformat}
> 2018-06-18 21:07:37,377 WARN
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received
> SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting
> down.
> 2018-06-18 21:07:37,377 WARN
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from
> ResourceManager: DECOMMISSIONING node-6.hostname.com:8041 is ready to be
> decommissioned
> {noformat}
> Note the two different messages from the RM ("Disallowed NodeManager" vs
> "DECOMMISSIONING"). The problem is that {{ResourceTrackerService}} can see an
> inconsistent state of nodes while they're being updated:
> {noformat}
> 2018-06-18 21:00:17,575 INFO
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader
> include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219}
> exclude:{node-6.hostname.com}
> 2018-06-18 21:00:17,575 INFO
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully
> decommission node node-6.