[ 
https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13449101#comment-13449101
 ] 

Todd Lipcon commented on HDFS-3077:
-----------------------------------

bq. Hey Todd, I have not looked at the work in this branch in a while. One 
thing I wanted to ask you about is, why are we using journal daemons to decide 
on an epoch? Could zookeeper be used for doing the same? What are the 
advantages of using journal daemons instead of zk? Adding this information to 
the document might also be useful.

Certainly you could use ZK to generate an increasing sequence ID to decide on 
an epoch. But, given we already have the journal daemons, it's trivial to 
generate unique increasing sequence IDs without using an external dependency. 
The protocol is very simple:
- ask each of the JNs for their highest epoch seen
- set local epoch to one higher than the highest seen from any JN
- ask JNs to promise you that epoch

If you succeed on a quorum, then no one else can successfully achieve a quorum 
on the same epoch number. If you don't succeed, that means you raced with some 
other writer. At that point you could either retry or just fail.

There is a stress test that verifies that this protocol works correctly - 
please see TestEpochsAreUnique.

As for the advantages of not depending on ZooKeeper, my experience working with 
ZK in the context of the HBase Master has convinced me that it's not a panacea 
for situations like this. One of the biggest issues we've had in the HBase 
Master design is loss of synchronization between what is the truth in ZooKeeper 
vs what the individual participants think is the truth. ZooKeeper's consistency 
semantics are that different clients, when connected to different nodes in the 
quorum, may be arbitrarily "behind" in their view of the data. This means that, 
even if we update an epoch number in ZooKeeper, for example, one of the JNs may 
not receive the update for some number of seconds, and can continue to accept 
writes from previous writers. So, we still have to deal with fencing and all of 
these quorum protocols on our own, and I don't think ZK provides much for us.

The other advantage of building this as a self-contained system is that it's 
easier for us to test and debug. For example, the randomized test cases have 
been set up so that the entire system runs single-threaded and, given a random 
seed, can reproduce a given set of dropped messages. This would be very hard to 
implement on top of ZooKeeper where all of the messaging is opaque to our 
purposes.

The third thing I'll mention is what I informally call the "two things" 
problem: when you have some data in ZK, and some data on the JNs, it's possible 
that the two could get out of sync. For example, if an administrator 
accidentally reformats ZooKeeper, our fencing guarantees will become screwed 
up. So, we have to guard against this, add code to re-format ZK safely, etc. 
Another example situation is to consider what happens when a NN is partitioned 
from the majority of the ZooKeeper nodes but not partitioned from a majority of 
the JournalNodes. Should it stop writing? If the other NN can reach a quorum of 
ZK but not a quorum of JN, should it begin writing? Or should the whole system 
stop in its tracks? If the whole system stops, then we have introduced an 
availability dependency on ZooKeeper such that no edits may be made while ZK is 
down. This is worse off than we are today: we can continue operating while ZK 
is down (though we can't process a new failover).

So, to summarize, while I think ZK can reduce complexity for a lot of 
applications, in this case I prefer the control from "doing it ourselves". We 
already have to build all of the quorum counting infrastructure, etc, and don't 
see what there is to gain from the extra dependency. Hope all of the above 
makes sense!
                
> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>
>                 Key: HDFS-3077
>                 URL: https://issues.apache.org/jira/browse/HDFS-3077
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: ha, name-node
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>             Fix For: QuorumJournalManager (HDFS-3077)
>
>         Attachments: hdfs-3077-partial.txt, hdfs-3077.txt, hdfs-3077.txt, 
> hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, 
> qjournal-design.pdf, qjournal-design.pdf
>
>
> Currently, one of the weak points of the HA design is that it relies on 
> shared storage such as an NFS filer for the shared edit log. One alternative 
> that has been proposed is to depend on BookKeeper, a ZooKeeper subproject 
> which provides a highly available replicated edit log on commodity hardware. 
> This JIRA is to implement another alternative, based on a quorum commit 
> protocol, integrated more tightly in HDFS and with the requirements driven 
> only by HDFS's needs rather than more generic use cases. More details to 
> follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to