[jira] [Commented] (HDFS-3077) Quorum-based protocol for reading and writing edit logs

Todd Lipcon (Commented) (JIRA) Tue, 03 Apr 2012 13:24:47 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13245691#comment-13245691
 ]


Todd Lipcon commented on HDFS-3077:
-----------------------------------

bq. Terminology - JournalDaemon or JournalNode. I prefer JournalDaemon because 
my plan was to run them in the same process space as the namenode. A 
JournalDeamon could also be stand-alone process.

I prefer JournalNode because every other daemon we have is a *Node. If you're 
running it inside another process, I think we would just call it a 
"JournalService" -- or an "embedded JournalNode". I think of a daemon as a 
standalone process.

bq. I like the idea of quorum writes and maintaining the queue. 3092 design 
currently uses timeout to declare a JD slow and fail it. We were planning to 
punting on it until we had first implementation.

OK. This part I have done in the patch attached here and works pretty well, so 
far. If you want, I'm happy to separate out the quorum completion code to 
commit it ASAP so we can share code here.

bq. newEpoch() is called fence() in HDFS-3092. My preference is to use the name 
fence(). I was using version # which is called epoch. I think the name epoch 
sounds better. The key difference is that version # is generated from znode in 
HDFS-3092.

As I had commented earlier on this ticket, I originally was planning to do 
something similar to you, bootstrapping off of ZK to generate epoch numbers. 
But then, when I got into coding, I realized that this algorithm is actually 
not so hard to implement, and adding a dependency on ZK actually adds to the 
combinatorics of things to think about. I think the "standalone" nature of the 
approach outweighs what benefit we might get by reusing ZK.

bq. So two namenodes cannot use the same epoch number. I think there is a bug 
with the approach you have described, stemming from the fact that two namenodes 
can use the same epoch and step 3 in 2.4 can be completed independent of 
quorum. This is shown in Hari's example.

How can step 3 in section 2.4 be completed independent of quorum? Step 4 
indicates that it requires a quorum of nodes to respond successfully to the 
{{newEpoch}} message. Here's an example:

Initial state:
||Node||lastPromisedEpoch||
|JN1|1|
|JN2|1|
|JN3|1|

1. Two NNs (NN1 and NN2) enter step 1 concurrently. They both receive responses 
indicating {{lastPromisedEpoch==1}} from all of the JNs.
2. They both propose {{newEpoch(2)}}. The behavior of the JN ensures that it 
will only respond success to either NN1 or NN2, but not both (since it will 
fail if the proposedEpoch <= lastPromisedEpoch)
So, either NN1 or NN2 gets success from a majority. The other node will only 
get success from a minority, and thus will abort.

Note that with message losses or failures, it's possible for _neither_ of the 
nodes to get a quorum in the case of a race. That's OK, since we expect that an 
external leader election framework will eventually assist such that only one NN 
is trying to become active, and then that NN will win.

Note that the epoch algorithm is cribbed from ZAB, see page 7 of Yahoo tech 
report YL-2010-0007. The mapping from ZAB terminology is:
||ZAB term||QJournal term||
|CEPOCH(e)|Response to getLastPromisedEpoch()|
|NEWEPOCH(e')|newEpoch(proposedEpoch)|
|ACK-E(...)|success response to newEpoch()|

                
> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>
>                 Key: HDFS-3077
>                 URL: https://issues.apache.org/jira/browse/HDFS-3077
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: ha, name-node
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: hdfs-3077-partial.txt, qjournal-design.pdf
>
>
> Currently, one of the weak points of the HA design is that it relies on 
> shared storage such as an NFS filer for the shared edit log. One alternative 
> that has been proposed is to depend on BookKeeper, a ZooKeeper subproject 
> which provides a highly available replicated edit log on commodity hardware. 
> This JIRA is to implement another alternative, based on a quorum commit 
> protocol, integrated more tightly in HDFS and with the requirements driven 
> only by HDFS's needs rather than more generic use cases. More details to 
> follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3077) Quorum-based protocol for reading and writing edit logs

Reply via email to