[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725199#comment-16725199 ]
Íñigo Goiri commented on YARN-9151: ----------------------------------- Thanks [~yqwang] for the patch. I think we want to add a specific test which handles the actual exception (i.e., {{UnknownHostException}}) and catch it. It should be a matter of adding a weird host to the connect string. Regarding the checkstyle, I'm not very sure how it checks indentation for switch/case but as this the first place it is used, let's follow the recommendation from Yetus and move to the left all the {{case}}. > Standby RM hangs (not retry or crash) forever due to forever lost from leader > election > -------------------------------------------------------------------------------------- > > Key: YARN-9151 > URL: https://issues.apache.org/jira/browse/YARN-9151 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.9.2 > Reporter: Yuqi Wang > Assignee: Yuqi Wang > Priority: Major > Labels: patch > Fix For: 3.1.1 > > Attachments: YARN-9151.001.patch > > > {color:#205081}*Issue Summary:*{color} > Standby RM hangs (not retry or crash) forever due to forever lost from > leader election > > {color:#205081}*Issue Repro Steps:*{color} > # Start multiple RMs in HA mode > # Modify all hostnames in the zk connect string to different values in DNS. > (In reality, we need to replace old/bad zk machines to new/good zk machines, > so their DNS hostname will be changed.) > > {color:#205081}*Issue Logs:*{color} > See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318). > To make it clear, the whole story is: > {noformat} > Join Election > Win the leader (ZK Node Creation Callback) > Start to becomeActive > Start RMActiveServices > Start CommonNodeLabelsManager failed due to zk connect > UnknownHostException > Stop CommonNodeLabelsManager > Stop RMActiveServices > Create and Init RMActiveServices > Fail to becomeActive > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException (Here the > exception is eat and just send event) > Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Already in standby state > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException (Here the > exception is eat and just send event) > Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Found RMActiveServices's StandByTransitionRunnable object has already run > previously, so immediately return > > {noformat} > The standby RM failed to rejoin the election, but it will never retry or > crash later, *so afterwards no zk related logs and the standby RM is forever > hang, even if the zk connect string hostnames are changed back the orignal > ones in DNS.* > So, this should be a bug in RM, because *RM should always try to join > election* (give up join election should only happen on RM decide to crash), > otherwise, a RM without inside the election can never become active again and > start real works. > > {color:#205081}*Caused By:*{color} > It is introduced by YARN-3742 > The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent > happens, RM should transition to standby, instead of crash. > *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition > to standby, instead of crash.* (In contrast, before this change, RM makes all > to crash instead of to standby) > So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it > will leave the standby RM continue not work, such as stay in standby forever. > And as the author said: > {quote}I think a good approach here would be to change the RMFatalEvent > handler to transition to standby as the default reaction, *with shutdown as a > special case for certain types of failures.* > {quote} > But the author is *too optimistic when implement the patch.* > > {color:#205081}*What the Patch's solution:*{color} > So, for *conservative*, we would better *only transition to standby for the > failures in {color:#14892c}whitelist{color}:* > public enum RMFatalEventType { > {color:#14892c}// Source <- Store{color} > {color:#14892c}STATE_STORE_FENCED,{color} > {color:#14892c}STATE_STORE_OP_FAILED,{color} > // Source <- Embedded Elector > EMBEDDED_ELECTOR_FAILED, > {color:#14892c}// Source <- Admin Service{color} > {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color} > // Source <- Critical Thread Crash > CRITICAL_THREAD_CRASH > } > And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and > future added failure types (until we triaged it to be in whitelist), should > crash RM, because we *cannot ensure* that they will *never* cause RM cannot > work in standby state, and the *conservative* way is to crash RM. > Besides, after crash, the RM's external watchdog service can know this and > try to repair the RM machine, send alerts, etc. > And the RM can reload the latest zk connect string config with the latest > hostnames. > For more details, please check the patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org