Re: why not zookeeper for the namenode

Todd Lipcon Fri, 19 Feb 2010 08:00:00 -0800

On Fri, Feb 19, 2010 at 12:41 AM, Thomas Koch <tho...@koch.ro> wrote:
> Hi,
>
> yesterday I read the documentation of zookeeper and the zk contrib bookkeeper.
> From what I read, I thought, that bookkeeper would be the ideal enhancement
> for the namenode, to make it distributed and therefor finaly highly available.
> Now I searched, if work in that direction has already started and found out,
> that apparently a totaly different approach has been choosen:
> http://issues.apache.org/jira/browse/HADOOP-4539
>
> Since I'm new to hadoop, I do trust in your decision. However I'd be glad, if
> somebody could satisfy my curiosity:
>


I didn't work on that particular design, but I'll do my best to answer
your questions below:

> - Why hasn't zookeeper(-bookkeeper) not been choosen? Especially since it
>  seems to do a similiar job already in hbase.
>

HBase does not use Bookkeeper, currently. Rather, it just uses ZK for
election and some small amount of metadata tracking. It therefore is
only storing a small amount of data in ZK, whereas the Hadoop NN would
have to store many GB worth of namesystem data. I don't think anyone
has tried putting such a large amount of data in ZK yet, and being the
first to do something is never without problems :)

Additionally, when this design was made, Bookkeeper was very new. It's
still in development, as I understand it.

> - Isn't it, that with HADOOP-4539 client's can only connect to one namenode at
>  a time, leaving the burden of all reads and writes on the one's shoulder?
>

Yes.

> - Isn't it, that zookeeper would be more network efficient. It requires only a
>  majority of nodes to receive a change, while HADOOP-4539 seems to require
>  all backup nodes to receive a change before its persisted.
>

Potentially. However, "all backup nodes" is usually just 1. In our
experience, and the experience of most other Hadoop deployments I've
spoken with, the primary factors decreasing NN availability are *not*
system crashes, but rather lack of online upgrade capability, slow
restart time for planned restarts, etc. Adding a hot standby can help
with the planned upgrade situation, but two standbys doesn't give you
much reliability above one. In a datacenter, the failure correlations
are generally such that racks either fail independently, or the entire
DC has lost power. So, there aren't a lot of cases where 3 NN replicas
would buy you much over 2.

-Todd

> Thanks for any explanation,
>
> Thomas Koch, http://www.koch.ro
>

Re: why not zookeeper for the namenode

Reply via email to