[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yuqi Wang updated YARN-9151: ---------------------------- Fix Version/s: (was: 3.1.1) > Standby RM hangs (not retry or crash) forever due to forever lost from leader > election > -------------------------------------------------------------------------------------- > > Key: YARN-9151 > URL: https://issues.apache.org/jira/browse/YARN-9151 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.9.2 > Reporter: Yuqi Wang > Assignee: Yuqi Wang > Priority: Major > Labels: patch > Fix For: 3.1.1 > > Attachments: YARN-9151-branch-2.9.2.001.patch, yarn_rm.zip > > > *Issue Summary:* > Standby RM hangs (not retry or crash) forever due to forever lost from > leader election > > *Issue Repro Steps:* > # Start multiple RMs in HA mode > # Modify all hostnames in the zk connect string to different values in DNS. > (In reality, we need to replace old/bad zk machines to new/good zk machines, > so their DNS hostname will be changed.) > > *Issue Logs:* > The RM is BN4SCH101222318 > You can check the full RM log in attachment, yarn_rm.zip. > To make it clear, the whole story is: > {noformat} > Join Election > Win the leader (ZK Node Creation Callback) > Start to becomeActive > Start RMActiveServices > Start CommonNodeLabelsManager failed due to zk connect > UnknownHostException > Stop CommonNodeLabelsManager > Stop RMActiveServices > Create and Init RMActiveServices > Fail to becomeActive > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException > (Here the exception is eat and just send transition to Standby event) > Send RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Already in standby state > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException > (Here the exception is eat and just send transition to Standby event) > Send RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Found RMActiveServices's StandByTransitionRunnable object has already run > previously, so immediately return > > (The standby RM failed to re-join the election, but it will never retry or > crash later, so afterwards no zk related logs and the standby RM is forever > hang.) > {noformat} > So, this should be a bug in RM, because *RM should always try to join > election* (give up join election should only happen on RM decide to crash), > otherwise, a RM without inside the election can never become active again and > start real works. > > *Caused By:* > It is introduced by YARN-3742 > The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent > happens, RM should transition to standby, instead of crash. > *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition > to standby, instead of crash.* (In contrast, before this change, RM makes all > to crash instead of to standby) > So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it > will leave the standby RM continue not work, such as stay in standby forever. > And as the author said: > {quote}I think a good approach here would be to change the RMFatalEvent > handler to transition to standby as the default reaction, *with shutdown as a > special case for certain types of failures.* > {quote} > But the author is *too optimistic when implement the patch.* > > *What the Patch's solution:* > So, for *conservative*, we would better *only transition to standby for the > failures in {color:#14892c}whitelist{color}:* > public enum RMFatalEventType { > {color:#14892c}// Source <- Store{color} > {color:#14892c}STATE_STORE_FENCED,{color} > {color:#14892c}STATE_STORE_OP_FAILED,{color} > // Source <- Embedded Elector > EMBEDDED_ELECTOR_FAILED, > {color:#14892c}// Source <- Admin Service{color} > {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color} > // Source <- Critical Thread Crash > CRITICAL_THREAD_CRASH > } > And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and > future added failure types, should crash RM, because we *cannot ensure* that > they will never cause RM cannot work in standby state, the *conservative* way > is to crash RM. Besides, after crash, the RM watchdog can know this and try > to repair the RM machine, send alerts, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org