[ https://issues.apache.org/jira/browse/YARN-6102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16083549#comment-16083549 ]
Jian He commented on YARN-6102: ------------------------------- the rmContext related looks a bit confusing. could you add comments to explain what each context means ? And maybe call it alwaysOnService instead of globalService ? and try to group the setter/getter for each context separately in the code layout and may draw a line in between. > On failover RM can crash due to unregistered event to AsyncDispatcher > --------------------------------------------------------------------- > > Key: YARN-6102 > URL: https://issues.apache.org/jira/browse/YARN-6102 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 2.8.0, 2.7.3 > Reporter: Ajith S > Assignee: Rohith Sharma K S > Priority: Critical > Attachments: eventOrder.JPG, YARN-6102.01.patch, YARN-6102.02.patch, > YARN-6102.03.patch > > > {code}2017-01-17 16:42:17,911 FATAL [AsyncDispatcher event handler] > event.AsyncDispatcher (AsyncDispatcher.java:dispatch(200)) - Error in > dispatcher thread > java.lang.Exception: No handler for registered for class > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeEventType > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:196) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:120) > at java.lang.Thread.run(Thread.java:745) > 2017-01-17 16:42:17,914 INFO [AsyncDispatcher ShutDown handler] > event.AsyncDispatcher (AsyncDispatcher.java:run(303)) - Exiting, bbye..{code} > The same stack i was also noticed in {{TestResourceTrackerOnHA}} exits > abnormally, after some analysis, i was able to reproduce. > Once the nodeHeartBeat is sent to RM, inside > {{org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.nodeHeartbeat(NodeHeartbeatRequest)}}, > before sending it to dispatcher through > {{this.rmContext.getDispatcher().getEventHandler().handle(nodeStatusEvent);}} > if RM failover is called, the dispatcher is reset > The new dispatcher is however first started and then the events are > registered at > {{org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.reinitialize(boolean)}} > So event order will look like > 1. Send Node heartbeat to {{ResourceTrackerService}} > 2. In {{ResourceTrackerService.nodeHeartbeat}}, before passing to dispatcher > call RM failover > 3. In RM Failover, current active will reset dispatcher @reinitialize i.e ( > {{resetDispatcher();}} + {{createAndInitActiveServices();}} ) > Now between {{resetDispatcher();}} and {{createAndInitActiveServices();}} , > the {{ResourceTrackerService.nodeHeartbeat}} invokes dipatcher > This will cause the above error as at point of time when {{STATUS_UPDATE}} > event is given to dispatcher in {{ResourceTrackerService}} , the new > dispatcher(from the failover) may be started but not yet registered for events > Using same steps(with pausing JVM at debug), i was able to reproduce this in > production cluster also. for {{STATUS_UPDATE}} active service event, when the > service is yet to forward the event to RM dispatcher but a failover is called > and dispatcher reset is between {{resetDispatcher();}} & > {{createAndInitActiveServices();}} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org