[ 
https://issues.apache.org/jira/browse/HADOOP-8217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13240960#comment-13240960
 ] 

Suresh Srinivas commented on HADOOP-8217:
-----------------------------------------

bq. we've already had a meeting ostensibly for this purpose, I think. 
The way I understood the meeting we had was more about the next steps and not 
design details.

bq. with the original manual-failover project...
HDFS-1623 was not a manual-failover project. It did talk about automatic 
failover. It is just that we decided to merge the branch post manual failover. 
Any way that is orthogonal.

While HDFS-1623 did give high level direction, some of the design could have 
been hashed out in more detail. It would have helped people follow what is 
happening, instead of having to piece together the design through numerous 
jiras. Any way that is my opinion. I also heard concerns from folks following 
that branch, that the development looked chaotic...

bq. So, I am not going to pause work to wait for meetings or more design 
discussion.
Well it us up to you. Complex design such as FailoverController could benefit 
from meeting of folks than doing it in comments over jiras. At least, some of 
our own internal discussion on this (for example ZK library we did and other 
design we are doing) greatly benefited from real time discussions.

bq. Since there seems to be concern that we are moving too fast, I will create 
an auto-failover branch later tonight to continue working on implementing this 
design.
Thanks for doing that.

bq. HDFS-2185...
Will review the design and post the comments.
                
> Edge case split-brain race in ZK-based auto-failover
> ----------------------------------------------------
>
>                 Key: HADOOP-8217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8217
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: auto-failover, ha
>    Affects Versions: 0.24.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>
> As discussed in HADOOP-8206, the current design for automatic failover has 
> the following race:
> - ZKFC1 gets active lock
> - ZKFC1 is about to send transitionToActive() and machine freezes (eg GC 
> pause + swapping)
> - ZKFC1 loses its ZK lock, ZKFC2 gets ZK lock
> - ZKFC2 calls transitionToStandby on NN1, and transitions NN2 to active
> - ZKFC1 wakes up from pause, calls transitionToActive(), now we have a bad 
> situation
> This is rare, since it requires ZKFC1 to freeze longer than its ZK session 
> timeout, but worth fixing, since the results can be disastrous.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to