[ 
https://issues.apache.org/jira/browse/YARN-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173395#comment-15173395
 ] 

sandflee commented on YARN-4741:
--------------------------------

without the fix of YARN-3990 and YARN-3896, our rm was flooded by node up/down 
events, and node is synced.  and have the same output in NM.
{quote}
2016-02-18 01:39:43,217 WARN 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node is out of 
sync with ResourceManager, hence resyncing.
2016-02-18 01:39:43,217 WARN 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from 
ResourceManager: Too far behind rm response id:100314 nm response id:0
{quote}

things may like that:
1,  nm restarted,  ResourceTrackerService send a NodeReconnectEvent to reset 
response id to 0,
2,  nodeHeartBeat is processed before NodeReconnectEvent is handled(dispatcher 
is flooded by RMAppNodeUpateEvent),  RM send sync command to NM for mismatch of 
response id,
3,  rmNode comes to REBOOT status, and remove it from rmContext.activeNodes
4,  nm register, create a new rmNode, added to  rmContext.activeNodes and send 
NodeStartEvent
5,  scheduler  complete the container running on node,   to AM container, will 
send FINISHED_CONTAINERS_PULLED_BY_AM event to RMNode , but the RMNode is in 
NEW state, couldn't handle FINISHED_CONTAINERS_PULLED_BY_AM.

> RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async 
> dispatcher event queue
> -----------------------------------------------------------------------------------------------
>
>                 Key: YARN-4741
>                 URL: https://issues.apache.org/jira/browse/YARN-4741
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Sangjin Lee
>            Priority: Critical
>         Attachments: nm.log
>
>
> We had a pretty major incident with the RM where it was continually flooded 
> with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event 
> queue.
> In our setup, we had the RM HA or stateful restart *disabled*, but NM 
> work-preserving restart *enabled*. Due to other issues, we did a cluster-wide 
> NM restart.
> Some time during the restart (which took multiple hours), we started seeing 
> the async dispatcher event queue building. Normally it would log 1,000. In 
> this case, it climbed all the way up to tens of millions of events.
> When we looked at the RM log, it was full of the following messages:
> {noformat}
> 2016-02-18 01:47:29,530 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid 
> event FINISHED_CONTAINERS_PULLED_BY_AM on Node  worker-node-foo.bar.net:8041
> 2016-02-18 01:47:29,535 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle 
> this event at current state
> 2016-02-18 01:47:29,535 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid 
> event FINISHED_CONTAINERS_PULLED_BY_AM on Node  worker-node-foo.bar.net:8041
> 2016-02-18 01:47:29,538 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle 
> this event at current state
> 2016-02-18 01:47:29,538 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid 
> event FINISHED_CONTAINERS_PULLED_BY_AM on Node  worker-node-foo.bar.net:8041
> {noformat}
> And that node in question was restarted a few minutes earlier.
> When we inspected the RM heap, it was full of 
> RMNodeFinishedContainersPulledByAMEvents.
> Suspecting the NM work-preserving restart, we disabled it and did another 
> cluster-wide rolling restart. Initially that seemed to have helped reduce the 
> queue size, but the queue built back up to several millions and continued for 
> an extended period. We had to restart the RM to resolve the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to