[jira] [Created] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

Yuqi Wang (JIRA) Wed, 19 Dec 2018 02:52:59 -0800

Yuqi Wang created YARN-9151:
-------------------------------

             Summary: Standby RM hangs (not retry or crash) forever due to 
forever lost from leader election
                 Key: YARN-9151
                 URL: https://issues.apache.org/jira/browse/YARN-9151
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager
    Affects Versions: 2.9.2
            Reporter: Yuqi Wang
            Assignee: Yuqi Wang
             Fix For: 3.1.1
         Attachments: yarn_rm.zip


*Issue Summary:*
 Standby RM hangs (not retry or crash) forever due to forever lost from leader 
election

*Issue Repro Steps:*
 # Start multiple RMs in HA mode
 # Modify all hostnames in the zk connect string to different values in DNS. 
(In reality, we need to replace old/bad zk machines to new/good zk machines, so 
their DNS hostname will be changed.)

 

*Issue Logs:*

The RM is BN4SCH101222318

You can check the full RM log in attachment, yarn_rm.zip.

To make it clear, the whole story is:
{noformat}
Join Election
Win the leader (ZK Node Creation Callback)
  Start to becomeActive 
    Start RMActiveServices 
    Start CommonNodeLabelsManager failed due to zk connect UnknownHostException
    Stop CommonNodeLabelsManager
    Stop RMActiveServices
    Create and Init RMActiveServices
  Fail to becomeActive 
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException 
  (Here the exception is eat and just send transition to Standby event)
  Send RMFatalEvent to transition RM to standby
Transitioning RM to Standby
  Start StandByTransitionThread
  Already in standby state
  ReJoin Election
  Failed to Join Election due to zk connect UnknownHostException
  (Here the exception is eat and just send transition to Standby event)
  Send RMFatalEvent to transition RM to standby
Transitioning RM to Standby
  Start StandByTransitionThread
  Found RMActiveServices's StandByTransitionRunnable object has already run 
previously, so immediately return
   
(The standby RM failed to re-join the election, but it will never retry or 
crash later, so afterwards no zk related logs and the standby RM is forever 
hang.)
{noformat}
So, this should be a bug in RM, because *RM should always try to join election* 
(give up join election should only happen on RM decide to crash), otherwise, a 
RM without inside the election can never become active again and start real 
works.

 

*Caused By:*

It is introduced by YARN-3742

The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent 
happens, RM should transition to standby, instead of crash.
 *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition to 
standby, instead of crash.* (In contrast, before this change, RM makes all to 
crash instead of to standby)
 So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it will 
leave the standby RM continue not work, such as stay in standby forever.

And as the author 
[said|https://issues.apache.org/jira/browse/YARN-3742?focusedCommentId=15891385&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15891385]:
{quote}I think a good approach here would be to change the RMFatalEvent handler 
to transition to standby as the default reaction, *with shutdown as a special 
case for certain types of failures.*
{quote}
But the author is *too optimistic when implement the patch.*

 

*What the Patch's solution:*

So, for *conservative*, we would better *only transition to standby for the 
failures in {color:#14892c}whitelist{color}:*
 public enum RMFatalEventType {
 *{color:#14892c}// Source <- Store{color}*
 *{color:#14892c}STATE_STORE_FENCED,{color}*
 *{color:#14892c}STATE_STORE_OP_FAILED,{color}*

// Source <- Embedded Elector
 EMBEDDED_ELECTOR_FAILED,

{color:#14892c}*// Source <- Admin Service*{color}
{color:#14892c} *TRANSITION_TO_ACTIVE_FAILED,*{color}

// Source <- Critical Thread Crash
 CRITICAL_THREAD_CRASH
 }
 And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and 
future added failure types, should crash RM, because we *cannot ensure* that 
they will never cause RM cannot work in standby state, the *conservative* way 
is to crash RM. Besides, after crash, the RM watchdog can know this and try to 
repair the RM machine, send alerts, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-9151) Standby RM hangs (not retry or crash) forever due to forever lost from leader election

Reply via email to