[ 
https://issues.apache.org/jira/browse/HDFS-14961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978287#comment-16978287
 ] 

Fei Hui commented on HDFS-14961:
--------------------------------

{quote}
In this case, when the Namenode Joined election it was a Standby Namenode only. 
{quote}
I debug and print some logs.  while it joined election, it was an observer 
namenode.
MonitorDaemon thread callchain is that
doHealthChecks -> enterState(State.SERVICE_HEALTHY) -> recheckElectability() -> 
elector.joinElection(targetToData(localTarget)) -> joinElectionInternal -> 
createLockNodeAsync

callBack for zookeeper
processResult -> becomeStandby

UT you mentioned can always reproduce this issue if sleep time greater than 10s
{code}
    int result = tool.run(
        new String[]{"-transitionToObserver", "-forcemanual", "nn2"});
    assertEquals("State transition returned: " + result, 0, result);
    Thread.sleep(10000);
    waitForHAState(1, HAServiceState.OBSERVER);
{code}

ha.failover-controller.graceful-fence.rpc-timeout.ms is 5000ms by default and 
timeout is 10s because of the following code.
{code}
  private void doGracefulFailover()
      throws ServiceFailedException, IOException, InterruptedException {
    int timeout = FailoverController.getGracefulFenceTimeout(conf) * 2;
{code}

Test for failover from nn2 to nn1 has been done, and then nn2 will not 
participate in the election until 10s timeout. After this, HealthMonitor can 
run into the function elector.joinElection 
elector.joinElection(targetToData(localTarget)) and change observer namenode to 
standby.

> Prevent ZKFC changing Observer Namenode state
> ---------------------------------------------
>
>                 Key: HDFS-14961
>                 URL: https://issues.apache.org/jira/browse/HDFS-14961
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Íñigo Goiri
>            Assignee: Ayush Saxena
>            Priority: Major
>         Attachments: HDFS-14961-01.patch, HDFS-14961-02.patch
>
>
> HDFS-14130 made ZKFC aware of the Observer Namenode and hence allows ZKFC 
> running along with the observer NOde.
> The Observer namenode isn't suppose to be part of ZKFC election process.
> But if the  Namenode was part of election, before turning into Observer by 
> transitionToObserver Command. The ZKFC still sends instruction to the 
> Namenode as a result of previous participation and sometimes tend to change 
> the state of Observer to Standby.
> This is also the reason for  failure in TestDFSZKFailoverController.
> TestDFSZKFailoverController has been consistently failing with a time out 
> waiting in testManualFailoverWithDFSHAAdmin(). In particular 
> {{waitForHAState(1, HAServiceState.OBSERVER);}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to