[ https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13227780#comment-13227780 ]
Todd Lipcon commented on HDFS-3077: ----------------------------------- I plan to post a more thorough design doc in the next week or two, but my current rough thinking on the design is: - Implement a standalone daemon which hosts the JournalProtocol interface, or something similar to it. This daemon is responsible for accepting edits from the active master and writing them locally. We will need to extend the interface to also allow reading from the logs, and a recovery operation (see below) - The user would configure an odd number of these logger daemons (could be collocated with the NNs or even with DNs, given a dedicated disk) - Create a JournalManager implementation which uses a quorum protocol to commit edits to the logger daemons. As for the particular quorum protocol, I am leaning towards implementing ZAB -- its properties align very closely to the semantics we need. I see the following advantages over the BookKeeper approach: - Re-uses existing Hadoop subsystems like IPC, security, and the file-based edit logging code. This means that it will be easier to maintain for the Hadoop development community, and easier to deploy for Hadoop operations. - Doesn't introduce a new dependency on an external project. If there is a bug discovered in this code, we can fix it with a new Hadoop release without having to wait on a new release of ZooKeeper. Since ZK and HDFS may be managed by different ops teams, this also simplifies upgrade. - BookKeeper is a general system, whereas this is a specific system. Since BK tries to be quite general, it has extra complexity that we don't need. For example, it handles the interleaving of up to thousands of distinct edit logs into a single on-disk layout. These complexities are useful for a general "write-ahead log as a service" project, but not for our use case where even very large clusters have only a handful of distinct logs. - BookKeeper's commit protocol waits for all replicas to commit. This means that, should one of the bookies fail, one must wait for a rather lengthy timeout before continuing. Additionally, the latency of a commit is the maximum of the latency of the bookies, meaning that it's much less feasible to collocate bookies with other machines under load like DataNodes. A quorum commit protocol instead has a latency equal to the median of its replicas' latencies, allowing it to ride over transient slowness on the part of one of its replicas. The BK approach is still interesting for some deployments, and I'd advocate that we explore both options in parallel. If one approach as implemented turns out to have significant advantages, we can at that time choose to support only one, but in the meantime, I think it's worth developing both approaches. > Quorum-based protocol for reading and writing edit logs > ------------------------------------------------------- > > Key: HDFS-3077 > URL: https://issues.apache.org/jira/browse/HDFS-3077 > Project: Hadoop HDFS > Issue Type: New Feature > Components: ha, name-node > Reporter: Todd Lipcon > Assignee: Todd Lipcon > > Currently, one of the weak points of the HA design is that it relies on > shared storage such as an NFS filer for the shared edit log. One alternative > that has been proposed is to depend on BookKeeper, a ZooKeeper subproject > which provides a highly available replicated edit log on commodity hardware. > This JIRA is to implement another alternative, based on a quorum commit > protocol, integrated more tightly in HDFS and with the requirements driven > only by HDFS's needs rather than more generic use cases. More details to > follow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira