[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197409#comment-14197409 ]
Vinod Kumar Vavilapalli commented on YARN-2579: ----------------------------------------------- Just got a summary of this from [~jianhe]. I think the fundamental problem is the main event dispatcher handling events (RMFatalEventType) that can take a lock on ResourceManager I propose the following # The main event dispatcher should be limited to handle events coming from active service. That way none of those events lock the resourcemanager itself. # State Store and Embedded elector DO NOT use the dispatcher to transition RM (This is because Dispatcher itself is an active service). ## Embedded elector can always synchronously transition RM state ## State store can spawn a separate thread to transition RM state. We can take a short-cut by transitioning RM state inside the StateStore's dispatcher itself, but eventually that event will try to close the StateStore - so we should avoid this. # StateStore sending out a fatal event and then proceeding ahead to do more state-store writes doesn't make sense. Once the StateStore sees a fatal event, it should go into a RMStateStoreState.SHUTDOWN state and stop processing any more events. We can do (3) in a separate patch to reduce scope. > Both RM's state is Active , but 1 RM is not really active. > ---------------------------------------------------------- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.5.1 > Reporter: Rohith > Assignee: Rohith > Priority: Blocker > Attachments: YARN-2579.patch, YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)