[jira] [Commented] (HDFS-3077) Quorum-based protocol for reading and writing edit logs

Ivan Kelly (Commented) (JIRA) Wed, 14 Mar 2012 11:57:03 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229500#comment-13229500
 ]


Ivan Kelly commented on HDFS-3077:
----------------------------------


{quote}
 *  Re-uses existing Hadoop subsystems like IPC, security, and the file-based 
edit logging code. This means that it will be easier to maintain for the Hadoop 
development community, and easier to deploy for Hadoop operations.
 *  Doesn't introduce a new dependency on an external project. If there is a 
bug discovered in this code, we can fix it with a new Hadoop release without 
having to wait on a new release of ZooKeeper. Since ZK and HDFS may be managed 
by different ops teams, this also simplifies upgrade.
{quote}
These arguments seem very much to be a case of NIH.

{quote}
 *  BookKeeper is a general system, whereas this is a specific system. Since BK 
tries to be quite general, it has extra complexity that we don't need. For 
example, it handles the interleaving of up to thousands of distinct edit logs 
into a single on-disk layout. These complexities are useful for a general 
"write-ahead log as a service" project, but not for our use case where even 
very large clusters have only a handful of distinct logs.
{quote}
So the plan is to step around this complexity by implementing ZAB?

{quote}
 *  BookKeeper's commit protocol waits for all replicas to commit. This means 
that, should one of the bookies fail, one must wait for a rather lengthy 
timeout before continuing. Additionally, the latency of a commit is the maximum 
of the latency of the bookies, meaning that it's much less feasible to 
collocate bookies with other machines under load like DataNodes. A quorum 
commit protocol instead has a latency equal to the median of its replicas' 
latencies, allowing it to ride over transient slowness on the part of one of 
its replicas.
{quote}
It would be actually very simple to change this within BookKeeper if needed. 
Instead of sending to a quorum, you could send to the ensemble, wait for 
responses from quorum. None of the guarantees of bookkeeper would be broken, 
though throughput would obviously drop. Currently, with BookKeeper, we're able 
to get higher throughput than when using a filer or a local file[1].

Also, I don't think ZAB is the right tool for this in any case. You have a 
single writer, which can therefore act as a sequencer on the entries. You just 
need to broadcast to an ensemble, and wait for quorum responses, as I outlined 
above for BookKeeper.

[1] http://people.apache.org/~ivank/tpt_mar14.pdf
                
> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>
>                 Key: HDFS-3077
>                 URL: https://issues.apache.org/jira/browse/HDFS-3077
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: ha, name-node
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>
> Currently, one of the weak points of the HA design is that it relies on 
> shared storage such as an NFS filer for the shared edit log. One alternative 
> that has been proposed is to depend on BookKeeper, a ZooKeeper subproject 
> which provides a highly available replicated edit log on commodity hardware. 
> This JIRA is to implement another alternative, based on a quorum commit 
> protocol, integrated more tightly in HDFS and with the requirements driven 
> only by HDFS's needs rather than more generic use cases. More details to 
> follow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3077) Quorum-based protocol for reading and writing edit logs

Reply via email to