[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988529#comment-13988529
 ] 

Tsuyoshi OZAWA commented on YARN-1861:
--------------------------------------

[~xgong] Great work. The test case by Xuan checks whether the fix by Karthik 
works well by injecting RMFatalEventType.STATE_STORE_FENCED directly.

My review comments are as follows:
{code}
             // Transition to standby and reinit active services
             LOG.info("Transitioning RM to Standby mode");
             rm.transitionToStandby(true);
+            rm.adminService.resetLeaderElection();
             return;
           } catch (Exception e) {
{code}

We should call rm.adminService.resetLeaderElection() in the finally block. If 
rm.transitionToStandby() fails while stoping RM's services, all RM can stuck.

{code}
+    int maxWaittingAttempt = 20;
+    while (maxWaittingAttempt -- > 0) {
{code}

maxWaittingAttempt should be maxWaitingAttempt.

> Both RM stuck in standby mode when automatic failover is enabled
> ----------------------------------------------------------------
>
>                 Key: YARN-1861
>                 URL: https://issues.apache.org/jira/browse/YARN-1861
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.4.0
>            Reporter: Arpit Gupta
>            Assignee: Xuan Gong
>            Priority: Blocker
>         Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
> YARN-1861.5.patch, yarn-1861-1.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to