[ 
https://issues.apache.org/jira/browse/YARN-6102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080787#comment-16080787
 ] 

Sunil G commented on YARN-6102:
-------------------------------

bq.How about recreating RMContext for each transition?
I think this is a better and cleaner solution here. Other option will delay the 
switchover given some events are pending in dispatcher queue.
Also this solution does not impact the new active services, give old events 
will play out with old context and which will be taken back by gc later.


> On failover RM can crash due to unregistered event to AsyncDispatcher
> ---------------------------------------------------------------------
>
>                 Key: YARN-6102
>                 URL: https://issues.apache.org/jira/browse/YARN-6102
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.8.0, 2.7.3
>            Reporter: Ajith S
>            Assignee: Ajith S
>            Priority: Critical
>         Attachments: eventOrder.JPG, YARN-6102.01.patch
>
>
> {code}2017-01-17 16:42:17,911 FATAL [AsyncDispatcher event handler] 
> event.AsyncDispatcher (AsyncDispatcher.java:dispatch(200)) - Error in 
> dispatcher thread
> java.lang.Exception: No handler for registered for class 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeEventType
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:196)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:120)
>         at java.lang.Thread.run(Thread.java:745)
> 2017-01-17 16:42:17,914 INFO  [AsyncDispatcher ShutDown handler] 
> event.AsyncDispatcher (AsyncDispatcher.java:run(303)) - Exiting, bbye..{code}
> The same stack i was also noticed in {{TestResourceTrackerOnHA}} exits 
> abnormally, after some analysis, i was able to reproduce.
> Once the nodeHeartBeat is sent to RM, inside 
> {{org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.nodeHeartbeat(NodeHeartbeatRequest)}},
>  before sending it to dispatcher through
> {{this.rmContext.getDispatcher().getEventHandler().handle(nodeStatusEvent);}} 
> if RM failover is called, the dispatcher is reset
> The new dispatcher is however first started and then the events are 
> registered at 
> {{org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.reinitialize(boolean)}}
> So event order will look like
> 1. Send Node heartbeat to {{ResourceTrackerService}}
> 2. In {{ResourceTrackerService.nodeHeartbeat}}, before passing to dispatcher 
> call RM failover
> 3. In RM Failover, current active will reset dispatcher @reinitialize i.e ( 
> {{resetDispatcher();}} + {{createAndInitActiveServices();}} )
> Now between {{resetDispatcher();}} and {{createAndInitActiveServices();}} , 
> the {{ResourceTrackerService.nodeHeartbeat}} invokes dipatcher
> This will cause the above error as at point of time when {{STATUS_UPDATE}} 
> event is given to dispatcher in {{ResourceTrackerService}} , the new 
> dispatcher(from the failover) may be started but not yet registered for events
> Using same steps(with pausing JVM at debug), i was able to reproduce this in 
> production cluster also. for {{STATUS_UPDATE}} active service event, when the 
> service is yet to forward the event to RM dispatcher but a failover is called 
> and dispatcher reset is between {{resetDispatcher();}} & 
> {{createAndInitActiveServices();}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to