[ https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13245691#comment-13245691 ]
Todd Lipcon commented on HDFS-3077: ----------------------------------- bq. Terminology - JournalDaemon or JournalNode. I prefer JournalDaemon because my plan was to run them in the same process space as the namenode. A JournalDeamon could also be stand-alone process. I prefer JournalNode because every other daemon we have is a *Node. If you're running it inside another process, I think we would just call it a "JournalService" -- or an "embedded JournalNode". I think of a daemon as a standalone process. bq. I like the idea of quorum writes and maintaining the queue. 3092 design currently uses timeout to declare a JD slow and fail it. We were planning to punting on it until we had first implementation. OK. This part I have done in the patch attached here and works pretty well, so far. If you want, I'm happy to separate out the quorum completion code to commit it ASAP so we can share code here. bq. newEpoch() is called fence() in HDFS-3092. My preference is to use the name fence(). I was using version # which is called epoch. I think the name epoch sounds better. The key difference is that version # is generated from znode in HDFS-3092. As I had commented earlier on this ticket, I originally was planning to do something similar to you, bootstrapping off of ZK to generate epoch numbers. But then, when I got into coding, I realized that this algorithm is actually not so hard to implement, and adding a dependency on ZK actually adds to the combinatorics of things to think about. I think the "standalone" nature of the approach outweighs what benefit we might get by reusing ZK. bq. So two namenodes cannot use the same epoch number. I think there is a bug with the approach you have described, stemming from the fact that two namenodes can use the same epoch and step 3 in 2.4 can be completed independent of quorum. This is shown in Hari's example. How can step 3 in section 2.4 be completed independent of quorum? Step 4 indicates that it requires a quorum of nodes to respond successfully to the {{newEpoch}} message. Here's an example: Initial state: ||Node||lastPromisedEpoch|| |JN1|1| |JN2|1| |JN3|1| 1. Two NNs (NN1 and NN2) enter step 1 concurrently. They both receive responses indicating {{lastPromisedEpoch==1}} from all of the JNs. 2. They both propose {{newEpoch(2)}}. The behavior of the JN ensures that it will only respond success to either NN1 or NN2, but not both (since it will fail if the proposedEpoch <= lastPromisedEpoch) So, either NN1 or NN2 gets success from a majority. The other node will only get success from a minority, and thus will abort. Note that with message losses or failures, it's possible for _neither_ of the nodes to get a quorum in the case of a race. That's OK, since we expect that an external leader election framework will eventually assist such that only one NN is trying to become active, and then that NN will win. Note that the epoch algorithm is cribbed from ZAB, see page 7 of Yahoo tech report YL-2010-0007. The mapping from ZAB terminology is: ||ZAB term||QJournal term|| |CEPOCH(e)|Response to getLastPromisedEpoch()| |NEWEPOCH(e')|newEpoch(proposedEpoch)| |ACK-E(...)|success response to newEpoch()| > Quorum-based protocol for reading and writing edit logs > ------------------------------------------------------- > > Key: HDFS-3077 > URL: https://issues.apache.org/jira/browse/HDFS-3077 > Project: Hadoop HDFS > Issue Type: New Feature > Components: ha, name-node > Reporter: Todd Lipcon > Assignee: Todd Lipcon > Attachments: hdfs-3077-partial.txt, qjournal-design.pdf > > > Currently, one of the weak points of the HA design is that it relies on > shared storage such as an NFS filer for the shared edit log. One alternative > that has been proposed is to depend on BookKeeper, a ZooKeeper subproject > which provides a highly available replicated edit log on commodity hardware. > This JIRA is to implement another alternative, based on a quorum commit > protocol, integrated more tightly in HDFS and with the requirements driven > only by HDFS's needs rather than more generic use cases. More details to > follow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira