On Fri, Feb 19, 2010 at 12:41 AM, Thomas Koch <tho...@koch.ro> wrote: > Hi, > > yesterday I read the documentation of zookeeper and the zk contrib bookkeeper. > From what I read, I thought, that bookkeeper would be the ideal enhancement > for the namenode, to make it distributed and therefor finaly highly available. > Now I searched, if work in that direction has already started and found out, > that apparently a totaly different approach has been choosen: > http://issues.apache.org/jira/browse/HADOOP-4539 > > Since I'm new to hadoop, I do trust in your decision. However I'd be glad, if > somebody could satisfy my curiosity: >
I didn't work on that particular design, but I'll do my best to answer your questions below: > - Why hasn't zookeeper(-bookkeeper) not been choosen? Especially since it > seems to do a similiar job already in hbase. > HBase does not use Bookkeeper, currently. Rather, it just uses ZK for election and some small amount of metadata tracking. It therefore is only storing a small amount of data in ZK, whereas the Hadoop NN would have to store many GB worth of namesystem data. I don't think anyone has tried putting such a large amount of data in ZK yet, and being the first to do something is never without problems :) Additionally, when this design was made, Bookkeeper was very new. It's still in development, as I understand it. > - Isn't it, that with HADOOP-4539 client's can only connect to one namenode at > a time, leaving the burden of all reads and writes on the one's shoulder? > Yes. > - Isn't it, that zookeeper would be more network efficient. It requires only a > majority of nodes to receive a change, while HADOOP-4539 seems to require > all backup nodes to receive a change before its persisted. > Potentially. However, "all backup nodes" is usually just 1. In our experience, and the experience of most other Hadoop deployments I've spoken with, the primary factors decreasing NN availability are *not* system crashes, but rather lack of online upgrade capability, slow restart time for planned restarts, etc. Adding a hot standby can help with the planned upgrade situation, but two standbys doesn't give you much reliability above one. In a datacenter, the failure correlations are generally such that racks either fail independently, or the entire DC has lost power. So, there aren't a lot of cases where 3 NN replicas would buy you much over 2. -Todd > Thanks for any explanation, > > Thomas Koch, http://www.koch.ro >