[ https://issues.apache.org/jira/browse/HADOOP-8217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242855#comment-13242855 ]
Todd Lipcon commented on HADOOP-8217: ------------------------------------- bq. 3. ZKFC2 tries to do transitionToStandby() on NN1. RPC times out. bq. 4. Don't know what happens now in your design As has been the case in all of the HA work up to and including this point, it initiates the fence method at this point. The fence method has to do persistent fencing of the shared resource (eg. disable access to the SAN or STONITH the node). Please refer to the code in which I think this is fairly clear. The solution here is to improve the ability to do failover when "graceful fencing" suffices. In many failover cases it's preferable to _not_ have to invoke STONITH or storage fencing, since those mechanisms will often require administrative intervention to un-fence. bq. Given, the above, how will NN1 receive the zxid from ZKFC2? If it does not then the solution is invalid. Hari's scenario exemplifies this. All transitionToActive/transitionToStandby calls would include the zxid. So, the sequence becomes: 1. ZKFC1 gets active lock (zxid=1) 2. ZKFC1 is about to send transitionToActive(1) and machine freezes (eg GC pause + swapping) 3. ZKFC1 loses its ZK lock, ZKFC2 gets ZK lock (zxid=2) 4. ZKFC2 calls NN1.transitionToStandby(2) and NN2.transitionToActive(2). 5. ZKFC1 wakes up from pause, calls NN1.transitionToActive(1). NN1 rejects the request because it previously accepted zxid=2 in step 4 above. or the failure case: 4(failure case): if NN1.transitionToStandby() times out or fails, the non-graceful fencing is initiated (same as in existing HA code for the last several months) 5(failure case with storage fencing): ZKFC1 wakes up from pause, and calls NN1.transitionToActive(1). NN1 tries to access the shared edits storage and fails, because it has been fenced. So, there is no split-brain. 5(failure case with STONITH): ZKFC1 never wakes up from pause, because its power plug has been pulled. So, there is no split-brain. > Edge case split-brain race in ZK-based auto-failover > ---------------------------------------------------- > > Key: HADOOP-8217 > URL: https://issues.apache.org/jira/browse/HADOOP-8217 > Project: Hadoop Common > Issue Type: Bug > Components: auto-failover, ha > Affects Versions: 0.24.0 > Reporter: Todd Lipcon > Assignee: Todd Lipcon > Attachments: hadoop-8217-testcase.txt > > > As discussed in HADOOP-8206, the current design for automatic failover has > the following race: > - ZKFC1 gets active lock > - ZKFC1 is about to send transitionToActive() and machine freezes (eg GC > pause + swapping) > - ZKFC1 loses its ZK lock, ZKFC2 gets ZK lock > - ZKFC2 calls transitionToStandby on NN1, and transitions NN2 to active > - ZKFC1 wakes up from pause, calls transitionToActive(), now we have a bad > situation > This is rare, since it requires ZKFC1 to freeze longer than its ZK session > timeout, but worth fixing, since the results can be disastrous. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira