Re: why not zookeeper for the namenode
From what I read, I thought, that bookkeeper would be the ideal enhancement for the namenode, to make it distributed and therefor finaly highly available. Being distributed doesn't imply high availability. Availability is about minimizing downtime. For example, a primary that can fail over to a secondary (and back) may be more available than a distributed system that needs to be restarted when it's software or dependencies are upgraded. A distributed system that can only handle x ops/second may be less available than a non-distributed system that can handle 2x ops/second. A large distributed system may be less available than n smaller systems, depending on consistency requirements. Implementation and management complexity often result in more downtime. Etc. Decreasing the time it takes to restart and upgrade an HDFS cluster would significantly improve it's availability for many users (there are jiras for these). - Why hasn't zookeeper(-bookkeeper) not been chosen? Persisting NN metadata over a set of servers is only part of the problem. You might be interested in checking out HDFS-976. Thanks, Eli
Re: why not zookeeper for the namenode
E. Sammer wrote: On 2/22/10 12:53 PM, zlatin.balev...@barclayscapital.com wrote: My 2 cents: If the NN stores all state behind a javax.Cache façade it will be possible to use all kinds of products (commercial, open source, facades to ZK) for redundancy, load balancing, etc. This would be pretty interesting from a deployment / scale point of view in that many jcache providers support flushing to disk and the concept of distribution and near / far data. Now, all of that said, this also removes some certainty from the name node contract, namely: If jcache was used we: - couldn't promise data would be in memory (both a plus and a minus). - couldn't promise data is on the same machine. - can't make guarantees about consistency (flushes, PITR features, etc.). It may be too general an abstraction layer for this type of application. It's a good avenue to explore, just playing devil's advocate. Regards. I know nothing about HA, though I work with people who do. They normally start their conversations with Steve, you idiot, you don't understand Because HA is all or nothing: either you are HA or you aren't. It's also somewhere you need to reach for the mathematics to prove works in theory, then test in the field in interesting situations. Even then, they have horror stories. We all know that NN's have limits, but most of those limits are known and documented: -doesn't like full disks -a corrupted editlog doesn't replay -if the 2ary NN isn't live, the NN will stay up (It's been discussed having it somehow react if there isn't any secondary around and you say you require one) There's also an open issue that may be related to UTF-8 translation that is confusing restarts, under discussion in -hdfs right now. What is best is this: everyone has the same code base, one that is tested at scale. If you switch to multiple back ends, then nobody other than the people who have the same back end as you will be able to replicate the problem. You lose the Yahoo! and Facebook run on PetaByte sized clusters, so my 40 TB is noise argument, replacing it with Yahoo! and Facebook run on one cache back end, I use something else and am on my own when it fails. I don't want to go there
Re: why not zookeeper for the namenode
Eli Collins wrote: From what I read, I thought, that bookkeeper would be the ideal enhancement for the namenode, to make it distributed and therefor finaly highly available. Being distributed doesn't imply high availability. Availability is about minimizing downtime. For example, a primary that can fail over to a secondary (and back) may be more available than a distributed system that needs to be restarted when it's software or dependencies are upgraded. A distributed system that can only handle x ops/second may be less available than a non-distributed system that can handle 2x ops/second. A large distributed system may be less available than n smaller systems, depending on consistency requirements. Implementation and management complexity often result in more downtime. Etc. Decreasing the time it takes to restart and upgrade an HDFS cluster would significantly improve it's availability for many users (there are jiras for these). There's another availability, engineering availability. What we have today is nice in that the HDFS engineering skills are distributed among a number of companies, and the source is there for anyone else to learn, rebuilding is fairly straightforward. Don't dismiss that as unimportant. Engineering availability means that if you discover a new problem, you have the ability to patch your copy of the code, and keep going, while filing a bug report for others to deal with. That significantly reduces your downtime on a software problem compared to filing a bugrep and hoping that a future release will have addressed it -steve
Re: why not zookeeper for the namenode
You can see the work Flavio and Luca did here integrating the NN with BookKeeper. http://bit.ly/aELMbH This addresses some issues but not all of them (and I'm not an expert on NN to be able to explain them all). I believe it would not address availability in the sense that bk alone does not mean that a secondary could come on-line (near) instantaneously in the case of primary failure. BK/ZK would also introduce 2 new components that need to be managed as part of a hadoop cluster - that's another issue that I've heard wrt adding this. Hadoop is not the only one to voice this concern, see this re cassandra and ZK: http://bit.ly/bW51rd As Todd mentioned BK is still pretty new. We are working on another significant project that is using it (working on open sourcing) but afaik there is no major project using it at this time. So that's def. a factor as well. Patrick On Fri, Feb 19, 2010 at 12:41 AM, Thomas Koch tho...@koch.ro wrote: Hi, yesterday I read the documentation of zookeeper and the zk contrib bookkeeper. From what I read, I thought, that bookkeeper would be the ideal enhancement for the namenode, to make it distributed and therefor finaly highly available. Now I searched, if work in that direction has already started and found out, that apparently a totaly different approach has been choosen: http://issues.apache.org/jira/browse/HADOOP-4539 Since I'm new to hadoop, I do trust in your decision. However I'd be glad, if somebody could satisfy my curiosity: I didn't work on that particular design, but I'll do my best to answer your questions below: - Why hasn't zookeeper(-bookkeeper) not been choosen? Especially since it seems to do a similiar job already in hbase. HBase does not use Bookkeeper, currently. Rather, it just uses ZK for election and some small amount of metadata tracking. It therefore is only storing a small amount of data in ZK, whereas the Hadoop NN would have to store many GB worth of namesystem data. I don't think anyone has tried putting such a large amount of data in ZK yet, and being the first to do something is never without problems :) Additionally, when this design was made, Bookkeeper was very new. It's still in development, as I understand it. - Isn't it, that with HADOOP-4539 client's can only connect to one namenode at a time, leaving the burden of all reads and writes on the one's shoulder? Yes. - Isn't it, that zookeeper would be more network efficient. It requires only a majority of nodes to receive a change, while HADOOP-4539 seems to require all backup nodes to receive a change before its persisted. Potentially. However, all backup nodes is usually just 1. In our experience, and the experience of most other Hadoop deployments I've spoken with, the primary factors decreasing NN availability are *not* system crashes, but rather lack of online upgrade capability, slow restart time for planned restarts, etc. Adding a hot standby can help with the planned upgrade situation, but two standbys doesn't give you much reliability above one. In a datacenter, the failure correlations are generally such that racks either fail independently, or the entire DC has lost power. So, there aren't a lot of cases where 3 NN replicas would buy you much over 2. -Todd
RE: why not zookeeper for the namenode
My 2 cents: If the NN stores all state behind a javax.Cache façade it will be possible to use all kinds of products (commercial, open source, facades to ZK) for redundancy, load balancing, etc. Zlatin -Original Message- From: Patrick Hunt [mailto:ph...@apache.org] Sent: Monday, February 22, 2010 12:47 PM To: t...@cloudera.com; tho...@koch.ro; common-user@hadoop.apache.org Subject: Re: why not zookeeper for the namenode You can see the work Flavio and Luca did here integrating the NN with BookKeeper. http://bit.ly/aELMbH This addresses some issues but not all of them (and I'm not an expert on NN to be able to explain them all). I believe it would not address availability in the sense that bk alone does not mean that a secondary could come on-line (near) instantaneously in the case of primary failure. BK/ZK would also introduce 2 new components that need to be managed as part of a hadoop cluster - that's another issue that I've heard wrt adding this. Hadoop is not the only one to voice this concern, see this re cassandra and ZK: http://bit.ly/bW51rd As Todd mentioned BK is still pretty new. We are working on another significant project that is using it (working on open sourcing) but afaik there is no major project using it at this time. So that's def. a factor as well. Patrick On Fri, Feb 19, 2010 at 12:41 AM, Thomas Koch tho...@koch.ro wrote: Hi, yesterday I read the documentation of zookeeper and the zk contrib bookkeeper. From what I read, I thought, that bookkeeper would be the ideal enhancement for the namenode, to make it distributed and therefor finaly highly available. Now I searched, if work in that direction has already started and found out, that apparently a totaly different approach has been choosen: http://issues.apache.org/jira/browse/HADOOP-4539 Since I'm new to hadoop, I do trust in your decision. However I'd be glad, if somebody could satisfy my curiosity: I didn't work on that particular design, but I'll do my best to answer your questions below: - Why hasn't zookeeper(-bookkeeper) not been choosen? Especially since it seems to do a similiar job already in hbase. HBase does not use Bookkeeper, currently. Rather, it just uses ZK for election and some small amount of metadata tracking. It therefore is only storing a small amount of data in ZK, whereas the Hadoop NN would have to store many GB worth of namesystem data. I don't think anyone has tried putting such a large amount of data in ZK yet, and being the first to do something is never without problems :) Additionally, when this design was made, Bookkeeper was very new. It's still in development, as I understand it. - Isn't it, that with HADOOP-4539 client's can only connect to one namenode at a time, leaving the burden of all reads and writes on the one's shoulder? Yes. - Isn't it, that zookeeper would be more network efficient. It requires only a majority of nodes to receive a change, while HADOOP-4539 seems to require all backup nodes to receive a change before its persisted. Potentially. However, all backup nodes is usually just 1. In our experience, and the experience of most other Hadoop deployments I've spoken with, the primary factors decreasing NN availability are *not* system crashes, but rather lack of online upgrade capability, slow restart time for planned restarts, etc. Adding a hot standby can help with the planned upgrade situation, but two standbys doesn't give you much reliability above one. In a datacenter, the failure correlations are generally such that racks either fail independently, or the entire DC has lost power. So, there aren't a lot of cases where 3 NN replicas would buy you much over 2. -Todd ___ This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing. Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP. This email may relate to or be sent from other members of the Barclays Group. ___
Re: why not zookeeper for the namenode
On 2/22/10 12:53 PM, zlatin.balev...@barclayscapital.com wrote: My 2 cents: If the NN stores all state behind a javax.Cache façade it will be possible to use all kinds of products (commercial, open source, facades to ZK) for redundancy, load balancing, etc. This would be pretty interesting from a deployment / scale point of view in that many jcache providers support flushing to disk and the concept of distribution and near / far data. Now, all of that said, this also removes some certainty from the name node contract, namely: If jcache was used we: - couldn't promise data would be in memory (both a plus and a minus). - couldn't promise data is on the same machine. - can't make guarantees about consistency (flushes, PITR features, etc.). It may be too general an abstraction layer for this type of application. It's a good avenue to explore, just playing devil's advocate. Regards. -- Eric Sammer e...@lifeless.net http://esammer.blogspot.com
why not zookeeper for the namenode
Hi, yesterday I read the documentation of zookeeper and the zk contrib bookkeeper. From what I read, I thought, that bookkeeper would be the ideal enhancement for the namenode, to make it distributed and therefor finaly highly available. Now I searched, if work in that direction has already started and found out, that apparently a totaly different approach has been choosen: http://issues.apache.org/jira/browse/HADOOP-4539 Since I'm new to hadoop, I do trust in your decision. However I'd be glad, if somebody could satisfy my curiosity: - Why hasn't zookeeper(-bookkeeper) not been choosen? Especially since it seems to do a similiar job already in hbase. - Isn't it, that with HADOOP-4539 client's can only connect to one namenode at a time, leaving the burden of all reads and writes on the one's shoulder? - Isn't it, that zookeeper would be more network efficient. It requires only a majority of nodes to receive a change, while HADOOP-4539 seems to require all backup nodes to receive a change before its persisted. Thanks for any explanation, Thomas Koch, http://www.koch.ro
Re: why not zookeeper for the namenode
On Fri, Feb 19, 2010 at 12:41 AM, Thomas Koch tho...@koch.ro wrote: Hi, yesterday I read the documentation of zookeeper and the zk contrib bookkeeper. From what I read, I thought, that bookkeeper would be the ideal enhancement for the namenode, to make it distributed and therefor finaly highly available. Now I searched, if work in that direction has already started and found out, that apparently a totaly different approach has been choosen: http://issues.apache.org/jira/browse/HADOOP-4539 Since I'm new to hadoop, I do trust in your decision. However I'd be glad, if somebody could satisfy my curiosity: I didn't work on that particular design, but I'll do my best to answer your questions below: - Why hasn't zookeeper(-bookkeeper) not been choosen? Especially since it seems to do a similiar job already in hbase. HBase does not use Bookkeeper, currently. Rather, it just uses ZK for election and some small amount of metadata tracking. It therefore is only storing a small amount of data in ZK, whereas the Hadoop NN would have to store many GB worth of namesystem data. I don't think anyone has tried putting such a large amount of data in ZK yet, and being the first to do something is never without problems :) Additionally, when this design was made, Bookkeeper was very new. It's still in development, as I understand it. - Isn't it, that with HADOOP-4539 client's can only connect to one namenode at a time, leaving the burden of all reads and writes on the one's shoulder? Yes. - Isn't it, that zookeeper would be more network efficient. It requires only a majority of nodes to receive a change, while HADOOP-4539 seems to require all backup nodes to receive a change before its persisted. Potentially. However, all backup nodes is usually just 1. In our experience, and the experience of most other Hadoop deployments I've spoken with, the primary factors decreasing NN availability are *not* system crashes, but rather lack of online upgrade capability, slow restart time for planned restarts, etc. Adding a hot standby can help with the planned upgrade situation, but two standbys doesn't give you much reliability above one. In a datacenter, the failure correlations are generally such that racks either fail independently, or the entire DC has lost power. So, there aren't a lot of cases where 3 NN replicas would buy you much over 2. -Todd Thanks for any explanation, Thomas Koch, http://www.koch.ro