[ https://issues.apache.org/jira/browse/HADOOP-8217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13240960#comment-13240960 ]
Suresh Srinivas commented on HADOOP-8217: ----------------------------------------- bq. we've already had a meeting ostensibly for this purpose, I think. The way I understood the meeting we had was more about the next steps and not design details. bq. with the original manual-failover project... HDFS-1623 was not a manual-failover project. It did talk about automatic failover. It is just that we decided to merge the branch post manual failover. Any way that is orthogonal. While HDFS-1623 did give high level direction, some of the design could have been hashed out in more detail. It would have helped people follow what is happening, instead of having to piece together the design through numerous jiras. Any way that is my opinion. I also heard concerns from folks following that branch, that the development looked chaotic... bq. So, I am not going to pause work to wait for meetings or more design discussion. Well it us up to you. Complex design such as FailoverController could benefit from meeting of folks than doing it in comments over jiras. At least, some of our own internal discussion on this (for example ZK library we did and other design we are doing) greatly benefited from real time discussions. bq. Since there seems to be concern that we are moving too fast, I will create an auto-failover branch later tonight to continue working on implementing this design. Thanks for doing that. bq. HDFS-2185... Will review the design and post the comments. > Edge case split-brain race in ZK-based auto-failover > ---------------------------------------------------- > > Key: HADOOP-8217 > URL: https://issues.apache.org/jira/browse/HADOOP-8217 > Project: Hadoop Common > Issue Type: Bug > Components: auto-failover, ha > Affects Versions: 0.24.0 > Reporter: Todd Lipcon > Assignee: Todd Lipcon > > As discussed in HADOOP-8206, the current design for automatic failover has > the following race: > - ZKFC1 gets active lock > - ZKFC1 is about to send transitionToActive() and machine freezes (eg GC > pause + swapping) > - ZKFC1 loses its ZK lock, ZKFC2 gets ZK lock > - ZKFC2 calls transitionToStandby on NN1, and transitions NN2 to active > - ZKFC1 wakes up from pause, calls transitionToActive(), now we have a bad > situation > This is rare, since it requires ZKFC1 to freeze longer than its ZK session > timeout, but worth fixing, since the results can be disastrous. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira