[ https://issues.apache.org/jira/browse/HADOOP-8217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242944#comment-13242944 ]
Todd Lipcon commented on HADOOP-8217: ------------------------------------- bq. I would like to question the value of FC2 calling NN1.transitionToStandby() in general. FC1 on NN1 is supposed to call NN1.transitionToStandby() because thats is FC1's responsibility upon losing the leader lock. This doesn't work, since FC1 can take arbitrarily long to notice that it has lost its lock. bq. Secondly, based on the recent work done to add breadcrumbs to the ActiveStandbyElector, FC2 is going to fence NN1 if NN1 has not gracefully given up the lock, which is clearly the case here. So the problem is already solved unless I am mistaken. But the first stage of "fencing" is to gracefully ask the NN to go to standby. This is exactly the problem here. If, instead, we always required that we always use an aggressive fencing mechanism (STONITH/NAS fencing), you're right that there would not be a problem. But we can avoid that in many cases -- for example, imagine that the active node loses its connection to the ZK quorum, but still has a connection to the other NN (eg by a crossover cable). In this case it will leave its breadcrumb znode there, but the new active can easily transition it to standby. Here's another way of looking at this JIRA: - the "aggressive" fencing mechanisms have the property of being "persistent". i.e after fencing, the node cannot become active, even if asked to. - the "graceful" fencing mechanism (transitionToStandby() RPC) does not currently have the property of being "persistent". If another older node asks it to become active after it's been "gracefully fenced", it will do so incorrectly. - This JIRA makes "graceful fencing" persistent, so it can be used correctly. Regarding the ActiveStandbyElector callback for {{becomeStandby}}, I actually think it's redundant. There are two cases in which it could be called: - If already standby, it's a no-op - If active, then this indicates that the elector lost its znode. Since it lost its znode (rather than quitting the election gracefully), it will leave its breadcrumb behind. Thus, the other node will fence it. So, calling transitionToStandby is redundant with fencing which the other node will have to perform anyway. > Edge case split-brain race in ZK-based auto-failover > ---------------------------------------------------- > > Key: HADOOP-8217 > URL: https://issues.apache.org/jira/browse/HADOOP-8217 > Project: Hadoop Common > Issue Type: Bug > Components: auto-failover, ha > Affects Versions: 0.24.0 > Reporter: Todd Lipcon > Assignee: Todd Lipcon > Attachments: hadoop-8217-testcase.txt > > > As discussed in HADOOP-8206, the current design for automatic failover has > the following race: > - ZKFC1 gets active lock > - ZKFC1 is about to send transitionToActive() and machine freezes (eg GC > pause + swapping) > - ZKFC1 loses its ZK lock, ZKFC2 gets ZK lock > - ZKFC2 calls transitionToStandby on NN1, and transitions NN2 to active > - ZKFC1 wakes up from pause, calls transitionToActive(), now we have a bad > situation > This is rare, since it requires ZKFC1 to freeze longer than its ZK session > timeout, but worth fixing, since the results can be disastrous. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira