[ https://issues.apache.org/jira/browse/YARN-9151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16724903#comment-16724903 ]
Yuqi Wang edited comment on YARN-9151 at 12/19/18 11:28 AM: ------------------------------------------------------------ BTW, [~jianhe], for YARN-4438, you said: {quote}_If it is due to close(), don't we want to force give-up so the other RM becomes active? If it is on initAndStartLeaderLatch(), *this RM will never become active; don't we want to just die?*_ What do you mean by force give-up ? exit RM ? The underlying curator implementation *will retry the connection in background*, even though the exception is thrown. See *Guaranteeable* interface in Curator. I think exit RM is too harsh here. Even though RM remains at standby, all services should be already shutdown, so there's no harm to the end users ? {quote} However, for this case, if we are using CuratorBasedElectorService, I think curator will *NOT* retry the connection, because I saw below things in the log and checked curator's code: *Background exception was not retry-able or retry gave up for UnknownHostException* {code:java} 2018-12-14 14:14:20,847 ERROR [Curator-Framework-0] org.apache.curator.framework.imps.CuratorFrameworkImpl: Background exception was not retry-able or retry gave up java.net.UnknownHostException: BN2AAP10C07C229 at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928) at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323) at java.net.InetAddress.getAllByName0(InetAddress.java:1276) at java.net.InetAddress.getAllByName(InetAddress.java:1192) at java.net.InetAddress.getAllByName(InetAddress.java:1126) at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61) at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:461) at org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29) at org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:146) at org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94) at org.apache.curator.HandleHolder.getZooKeeper(HandleHolder.java:55) at org.apache.curator.ConnectionState.reset(ConnectionState.java:218) at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:193) at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115) at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:806) at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792) at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62) at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} Besides, in YARN-4438, I did not see you used the Curator *Guaranteeable* interface. So, in the patch, if rejoin election throws exception, it will send EMBEDDED_ELECTOR_FAILED, and then RM will crash. was (Author: yqwang): BTW, [~jianhe], for YARN-4438, you said: {quote}_If it is due to close(), don't we want to force give-up so the other RM becomes active? If it is on initAndStartLeaderLatch(), *this RM will never become active; don't we want to just die?*_ What do you mean by force give-up ? exit RM ? The underlying curator implementation *will retry the connection in background*, even though the exception is thrown. See *Guaranteeable* interface in Curator. I think exit RM is too harsh here. Even though RM remains at standby, all services should be already shutdown, so there's no harm to the end users ? {quote} However, for this case, if we are using CuratorBasedElectorService, I think curator will *NOT* retry the connection, because I saw below things in the log and checked curator's code: *Background exception was not retry-able or retry gave up for UnknownHostException* {code:java} 2018-12-14 14:14:20,847 ERROR [Curator-Framework-0] org.apache.curator.framework.imps.CuratorFrameworkImpl: Background exception was not retry-able or retry gave up java.net.UnknownHostException: BN2AAP10C07C229 at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928) at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323) at java.net.InetAddress.getAllByName0(InetAddress.java:1276) at java.net.InetAddress.getAllByName(InetAddress.java:1192) at java.net.InetAddress.getAllByName(InetAddress.java:1126) at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61) at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:461) at org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29) at org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:146) at org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94) at org.apache.curator.HandleHolder.getZooKeeper(HandleHolder.java:55) at org.apache.curator.ConnectionState.reset(ConnectionState.java:218) at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:193) at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115) at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:806) at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792) at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62) at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} Besides, in YARN-4438, I did not see you used the Curator *Guaranteeable* interface. So, in the patch, if rejoin election throws exception, it will send EMBEDDED_ELECTOR_FAILED, and then RM will crash and reload the latest zk connect string config. > Standby RM hangs (not retry or crash) forever due to forever lost from leader > election > -------------------------------------------------------------------------------------- > > Key: YARN-9151 > URL: https://issues.apache.org/jira/browse/YARN-9151 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.9.2 > Reporter: Yuqi Wang > Assignee: Yuqi Wang > Priority: Major > Labels: patch > Fix For: 3.1.1 > > Attachments: YARN-9151.001.patch, yarn_rm.zip > > > {color:#205081}*Issue Summary:*{color} > Standby RM hangs (not retry or crash) forever due to forever lost from > leader election > > {color:#205081}*Issue Repro Steps:*{color} > # Start multiple RMs in HA mode > # Modify all hostnames in the zk connect string to different values in DNS. > (In reality, we need to replace old/bad zk machines to new/good zk machines, > so their DNS hostname will be changed.) > > {color:#205081}*Issue Logs:*{color} > See the full RM log in attachment, yarn_rm.zip (The RM is BN4SCH101222318). > To make it clear, the whole story is: > {noformat} > Join Election > Win the leader (ZK Node Creation Callback) > Start to becomeActive > Start RMActiveServices > Start CommonNodeLabelsManager failed due to zk connect > UnknownHostException > Stop CommonNodeLabelsManager > Stop RMActiveServices > Create and Init RMActiveServices > Fail to becomeActive > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException (Here the > exception is eat and just send event) > Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Already in standby state > ReJoin Election > Failed to Join Election due to zk connect UnknownHostException (Here the > exception is eat and just send event) > Send EMBEDDED_ELECTOR_FAILED RMFatalEvent to transition RM to standby > Transitioning RM to Standby > Start StandByTransitionThread > Found RMActiveServices's StandByTransitionRunnable object has already run > previously, so immediately return > > (The standby RM failed to rejoin the election, but it will never retry or > crash later, so afterwards no zk related logs and the standby RM is forever > hang, even if the zk connect string hostnames are changed back the orignal > ones in DNS.) > {noformat} > So, this should be a bug in RM, because *RM should always try to join > election* (give up join election should only happen on RM decide to crash), > otherwise, a RM without inside the election can never become active again and > start real works. > > {color:#205081}*Caused By:*{color} > It is introduced by YARN-3742 > The JIRA want to improve is that, when STATE_STORE_OP_FAILED RMFatalEvent > happens, RM should transition to standby, instead of crash. > *However, in fact, the JIRA makes ALL kinds of RMFatalEvent ONLY transition > to standby, instead of crash.* (In contrast, before this change, RM makes all > to crash instead of to standby) > So, even if EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH happens, it > will leave the standby RM continue not work, such as stay in standby forever. > And as the author > [said|https://issues.apache.org/jira/browse/YARN-3742?focusedCommentId=15891385&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15891385]: > {quote}I think a good approach here would be to change the RMFatalEvent > handler to transition to standby as the default reaction, *with shutdown as a > special case for certain types of failures.* > {quote} > But the author is *too optimistic when implement the patch.* > > {color:#205081}*What the Patch's solution:*{color} > So, for *conservative*, we would better *only transition to standby for the > failures in {color:#14892c}whitelist{color}:* > public enum RMFatalEventType { > {color:#14892c}// Source <- Store{color} > {color:#14892c}STATE_STORE_FENCED,{color} > {color:#14892c}STATE_STORE_OP_FAILED,{color} > // Source <- Embedded Elector > EMBEDDED_ELECTOR_FAILED, > {color:#14892c}// Source <- Admin Service{color} > {color:#14892c} TRANSITION_TO_ACTIVE_FAILED,{color} > // Source <- Critical Thread Crash > CRITICAL_THREAD_CRASH > } > And others, such as EMBEDDED_ELECTOR_FAILED or CRITICAL_THREAD_CRASH and > future added failure types, should crash RM, because we *cannot ensure* that > they will *never* cause RM cannot work in standby state, and the > *conservative* way is to crash RM. Besides, after crash, the RM's external > watchdog service can know this and try to repair the RM machine, send alerts, > etc. > For more details, please check the patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org