[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988529#comment-13988529 ]
Tsuyoshi OZAWA commented on YARN-1861: -------------------------------------- [~xgong] Great work. The test case by Xuan checks whether the fix by Karthik works well by injecting RMFatalEventType.STATE_STORE_FENCED directly. My review comments are as follows: {code} // Transition to standby and reinit active services LOG.info("Transitioning RM to Standby mode"); rm.transitionToStandby(true); + rm.adminService.resetLeaderElection(); return; } catch (Exception e) { {code} We should call rm.adminService.resetLeaderElection() in the finally block. If rm.transitionToStandby() fails while stoping RM's services, all RM can stuck. {code} + int maxWaittingAttempt = 20; + while (maxWaittingAttempt -- > 0) { {code} maxWaittingAttempt should be maxWaitingAttempt. > Both RM stuck in standby mode when automatic failover is enabled > ---------------------------------------------------------------- > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Affects Versions: 2.4.0 > Reporter: Arpit Gupta > Assignee: Xuan Gong > Priority: Blocker > Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, > YARN-1861.5.patch, yarn-1861-1.patch > > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)