[ https://issues.apache.org/jira/browse/HDFS-14961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978287#comment-16978287 ]
Fei Hui commented on HDFS-14961: -------------------------------- {quote} In this case, when the Namenode Joined election it was a Standby Namenode only. {quote} I debug and print some logs. while it joined election, it was an observer namenode. MonitorDaemon thread callchain is that doHealthChecks -> enterState(State.SERVICE_HEALTHY) -> recheckElectability() -> elector.joinElection(targetToData(localTarget)) -> joinElectionInternal -> createLockNodeAsync callBack for zookeeper processResult -> becomeStandby UT you mentioned can always reproduce this issue if sleep time greater than 10s {code} int result = tool.run( new String[]{"-transitionToObserver", "-forcemanual", "nn2"}); assertEquals("State transition returned: " + result, 0, result); Thread.sleep(10000); waitForHAState(1, HAServiceState.OBSERVER); {code} ha.failover-controller.graceful-fence.rpc-timeout.ms is 5000ms by default and timeout is 10s because of the following code. {code} private void doGracefulFailover() throws ServiceFailedException, IOException, InterruptedException { int timeout = FailoverController.getGracefulFenceTimeout(conf) * 2; {code} Test for failover from nn2 to nn1 has been done, and then nn2 will not participate in the election until 10s timeout. After this, HealthMonitor can run into the function elector.joinElection elector.joinElection(targetToData(localTarget)) and change observer namenode to standby. > Prevent ZKFC changing Observer Namenode state > --------------------------------------------- > > Key: HDFS-14961 > URL: https://issues.apache.org/jira/browse/HDFS-14961 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Íñigo Goiri > Assignee: Ayush Saxena > Priority: Major > Attachments: HDFS-14961-01.patch, HDFS-14961-02.patch > > > HDFS-14130 made ZKFC aware of the Observer Namenode and hence allows ZKFC > running along with the observer NOde. > The Observer namenode isn't suppose to be part of ZKFC election process. > But if the Namenode was part of election, before turning into Observer by > transitionToObserver Command. The ZKFC still sends instruction to the > Namenode as a result of previous participation and sometimes tend to change > the state of Observer to Standby. > This is also the reason for failure in TestDFSZKFailoverController. > TestDFSZKFailoverController has been consistently failing with a time out > waiting in testManualFailoverWithDFSHAAdmin(). In particular > {{waitForHAState(1, HAServiceState.OBSERVER);}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org