Re: why not zookeeper for the namenode

2010-02-23 Thread Eli Collins
 From what I read, I thought, that bookkeeper would be the ideal enhancement
 for the namenode, to make it distributed and therefor finaly highly available.

Being distributed doesn't imply high availability. Availability is
about minimizing downtime. For example, a primary that can fail over
to a secondary (and back) may be more available than a distributed
system that needs to be restarted when it's software or dependencies
are upgraded. A distributed system that can only handle x ops/second
may be less available than a non-distributed system that can handle 2x
ops/second. A large distributed system may be less available than n
smaller systems, depending on consistency requirements. Implementation
and management complexity often result in more downtime. Etc.
Decreasing the time it takes to restart and upgrade an HDFS cluster
would significantly improve it's availability for many users (there
are jiras for these).

 - Why hasn't zookeeper(-bookkeeper) not been chosen?

Persisting NN metadata over a set of servers is only part of the
problem. You might be interested in checking out HDFS-976.

Thanks,
Eli


Re: why not zookeeper for the namenode

2010-02-23 Thread Steve Loughran

E. Sammer wrote:

On 2/22/10 12:53 PM, zlatin.balev...@barclayscapital.com wrote:
My 2 cents: If the NN stores all state behind a javax.Cache façade it 
will be possible to use all kinds of products (commercial, open 
source, facades to ZK) for redundancy, load balancing, etc.


This would be pretty interesting from a deployment / scale point of view 
in that many jcache providers support flushing to disk and the concept 
of distribution and near / far data. Now, all of that said, this also 
removes some certainty from the name node contract, namely:


If jcache was used we:
 - couldn't promise data would be in memory (both a plus and a minus).
 - couldn't promise data is on the same machine.
 - can't make guarantees about consistency (flushes, PITR features, etc.).

It may be too general an abstraction layer for this type of application. 
It's a good avenue to explore, just playing devil's advocate.


Regards.


I know nothing about HA, though I work with people who do. They normally 
start their conversations with Steve, you idiot, you don't understand 
 Because HA is all or nothing: either you are HA or you aren't. 
It's also somewhere you need to reach for the mathematics to prove works 
in theory, then test in the field in interesting situations. Even then, 
they have horror stories.


We all know that NN's have limits, but most of those limits are known 
and documented:

 -doesn't like full disks
 -a corrupted editlog doesn't replay
 -if the 2ary NN isn't live, the NN will stay up (It's been discussed 
having it somehow react if there isn't any secondary around and you say 
you require one)


There's also an open issue that may be related to UTF-8 translation that 
is confusing restarts, under discussion in -hdfs right now.


What is best is this: everyone has the same code base, one that is 
tested at scale.


If you switch to multiple back ends, then nobody other than the people 
who have the same back end as you will be able to replicate the problem. 
You lose the Yahoo! and Facebook run on PetaByte sized clusters, so my 
40 TB is noise argument, replacing it with Yahoo! and Facebook run on 
one cache back end, I use something else and am on my own when it fails.


I don't want to go there




Re: why not zookeeper for the namenode

2010-02-23 Thread Steve Loughran

Eli Collins wrote:

From what I read, I thought, that bookkeeper would be the ideal enhancement
for the namenode, to make it distributed and therefor finaly highly available.


Being distributed doesn't imply high availability. Availability is
about minimizing downtime. For example, a primary that can fail over
to a secondary (and back) may be more available than a distributed
system that needs to be restarted when it's software or dependencies
are upgraded. A distributed system that can only handle x ops/second
may be less available than a non-distributed system that can handle 2x
ops/second. A large distributed system may be less available than n
smaller systems, depending on consistency requirements. Implementation
and management complexity often result in more downtime. Etc.
Decreasing the time it takes to restart and upgrade an HDFS cluster
would significantly improve it's availability for many users (there
are jiras for these).



There's another availability, engineering availability. What we have 
today is nice in that the HDFS engineering skills are distributed among 
a number of companies, and the source is there for anyone else to learn, 
rebuilding is fairly straightforward.


Don't dismiss that as unimportant. Engineering availability means that 
if you discover a new problem, you have the ability to patch your copy 
of the code, and keep going, while filing a bug report for others to 
deal with. That significantly reduces your downtime on a software 
problem compared to filing a bugrep and hoping that a future release 
will have addressed it


-steve



Re: why not zookeeper for the namenode

2010-02-22 Thread Patrick Hunt
You can see the work Flavio and Luca did here integrating the NN with 
BookKeeper. http://bit.ly/aELMbH This addresses some issues but not all 
of them (and I'm not an expert on NN to be able to explain them all). I 
believe it would not address availability in the sense that bk alone 
does not mean that a secondary could come on-line (near) instantaneously 
 in the case of primary failure. BK/ZK would also introduce 2 new 
components that need to be managed as part of a hadoop cluster - that's 
another issue that I've heard wrt adding this. Hadoop is not the only 
one to voice this concern, see this re cassandra and ZK: 
http://bit.ly/bW51rd


As Todd mentioned BK is still pretty new. We are working on another 
significant project that is using it (working on open sourcing) but 
afaik there is no major project using it at this time. So that's def. a 
factor as well.


Patrick



On Fri, Feb 19, 2010 at 12:41 AM, Thomas Koch tho...@koch.ro wrote:

Hi,

yesterday I read the documentation of zookeeper and the zk contrib bookkeeper.
From what I read, I thought, that bookkeeper would be the ideal enhancement
for the namenode, to make it distributed and therefor finaly highly available.
Now I searched, if work in that direction has already started and found out,
that apparently a totaly different approach has been choosen:
http://issues.apache.org/jira/browse/HADOOP-4539

Since I'm new to hadoop, I do trust in your decision. However I'd be glad, if
somebody could satisfy my curiosity:



I didn't work on that particular design, but I'll do my best to answer
your questions below:


- Why hasn't zookeeper(-bookkeeper) not been choosen? Especially since it
 seems to do a similiar job already in hbase.



HBase does not use Bookkeeper, currently. Rather, it just uses ZK for
election and some small amount of metadata tracking. It therefore is
only storing a small amount of data in ZK, whereas the Hadoop NN would
have to store many GB worth of namesystem data. I don't think anyone
has tried putting such a large amount of data in ZK yet, and being the
first to do something is never without problems :)

Additionally, when this design was made, Bookkeeper was very new. It's
still in development, as I understand it.


- Isn't it, that with HADOOP-4539 client's can only connect to one namenode at
 a time, leaving the burden of all reads and writes on the one's shoulder?



Yes.


- Isn't it, that zookeeper would be more network efficient. It requires only a
 majority of nodes to receive a change, while HADOOP-4539 seems to require
 all backup nodes to receive a change before its persisted.



Potentially. However, all backup nodes is usually just 1. In our
experience, and the experience of most other Hadoop deployments I've
spoken with, the primary factors decreasing NN availability are *not*
system crashes, but rather lack of online upgrade capability, slow
restart time for planned restarts, etc. Adding a hot standby can help
with the planned upgrade situation, but two standbys doesn't give you
much reliability above one. In a datacenter, the failure correlations
are generally such that racks either fail independently, or the entire
DC has lost power. So, there aren't a lot of cases where 3 NN replicas
would buy you much over 2.

-Todd




RE: why not zookeeper for the namenode

2010-02-22 Thread Zlatin.Balevsky
My 2 cents: If the NN stores all state behind a javax.Cache façade it will be 
possible to use all kinds of products (commercial, open source, facades to ZK) 
for redundancy, load balancing, etc.

Zlatin

-Original Message-
From: Patrick Hunt [mailto:ph...@apache.org] 
Sent: Monday, February 22, 2010 12:47 PM
To: t...@cloudera.com; tho...@koch.ro; common-user@hadoop.apache.org
Subject: Re: why not zookeeper for the namenode

You can see the work Flavio and Luca did here integrating the NN with 
BookKeeper. http://bit.ly/aELMbH This addresses some issues but not all of them 
(and I'm not an expert on NN to be able to explain them all). I believe it 
would not address availability in the sense that bk alone does not mean that 
a secondary could come on-line (near) instantaneously
  in the case of primary failure. BK/ZK would also introduce 2 new components 
that need to be managed as part of a hadoop cluster - that's another issue that 
I've heard wrt adding this. Hadoop is not the only one to voice this concern, 
see this re cassandra and ZK: 
http://bit.ly/bW51rd

As Todd mentioned BK is still pretty new. We are working on another significant 
project that is using it (working on open sourcing) but afaik there is no major 
project using it at this time. So that's def. a factor as well.

Patrick


 On Fri, Feb 19, 2010 at 12:41 AM, Thomas Koch tho...@koch.ro wrote:
 Hi,

 yesterday I read the documentation of zookeeper and the zk contrib 
 bookkeeper.
 From what I read, I thought, that bookkeeper would be the ideal 
 enhancement for the namenode, to make it distributed and therefor finaly 
 highly available.
 Now I searched, if work in that direction has already started and 
 found out, that apparently a totaly different approach has been choosen:
 http://issues.apache.org/jira/browse/HADOOP-4539

 Since I'm new to hadoop, I do trust in your decision. However I'd be 
 glad, if somebody could satisfy my curiosity:

 
 I didn't work on that particular design, but I'll do my best to answer 
 your questions below:
 
 - Why hasn't zookeeper(-bookkeeper) not been choosen? Especially 
 since it  seems to do a similiar job already in hbase.

 
 HBase does not use Bookkeeper, currently. Rather, it just uses ZK for 
 election and some small amount of metadata tracking. It therefore is 
 only storing a small amount of data in ZK, whereas the Hadoop NN would 
 have to store many GB worth of namesystem data. I don't think anyone 
 has tried putting such a large amount of data in ZK yet, and being the 
 first to do something is never without problems :)
 
 Additionally, when this design was made, Bookkeeper was very new. It's 
 still in development, as I understand it.
 
 - Isn't it, that with HADOOP-4539 client's can only connect to one 
 namenode at  a time, leaving the burden of all reads and writes on the one's 
 shoulder?

 
 Yes.
 
 - Isn't it, that zookeeper would be more network efficient. It 
 requires only a  majority of nodes to receive a change, while 
 HADOOP-4539 seems to require  all backup nodes to receive a change before 
 its persisted.

 
 Potentially. However, all backup nodes is usually just 1. In our 
 experience, and the experience of most other Hadoop deployments I've 
 spoken with, the primary factors decreasing NN availability are *not* 
 system crashes, but rather lack of online upgrade capability, slow 
 restart time for planned restarts, etc. Adding a hot standby can help 
 with the planned upgrade situation, but two standbys doesn't give you 
 much reliability above one. In a datacenter, the failure correlations 
 are generally such that racks either fail independently, or the entire 
 DC has lost power. So, there aren't a lot of cases where 3 NN replicas 
 would buy you much over 2.
 
 -Todd

___

This e-mail may contain information that is confidential, privileged or 
otherwise protected from disclosure. If you are not an intended recipient of 
this e-mail, do not duplicate or redistribute it by any means. Please delete it 
and any attachments and notify the sender that you have received it in error. 
Unless specifically indicated, this e-mail is not an offer to buy or sell or a 
solicitation to buy or sell any securities, investment products or other 
financial product or service, an official confirmation of any transaction, or 
an official statement of Barclays. Any views or opinions presented are solely 
those of the author and do not necessarily represent those of Barclays. This 
e-mail is subject to terms available at the following link: 
www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the 
foregoing.  Barclays Capital is the investment banking division of Barclays 
Bank PLC, a company registered in England (number 1026167) with its registered 
office at 1 Churchill Place, London, E14 5HP.  This email may relate to or be 
sent from other members of the Barclays Group.
___


Re: why not zookeeper for the namenode

2010-02-22 Thread E. Sammer

On 2/22/10 12:53 PM, zlatin.balev...@barclayscapital.com wrote:

My 2 cents: If the NN stores all state behind a javax.Cache façade it will be 
possible to use all kinds of products (commercial, open source, facades to ZK) 
for redundancy, load balancing, etc.


This would be pretty interesting from a deployment / scale point of view 
in that many jcache providers support flushing to disk and the concept 
of distribution and near / far data. Now, all of that said, this also 
removes some certainty from the name node contract, namely:


If jcache was used we:
 - couldn't promise data would be in memory (both a plus and a minus).
 - couldn't promise data is on the same machine.
 - can't make guarantees about consistency (flushes, PITR features, etc.).

It may be too general an abstraction layer for this type of application. 
It's a good avenue to explore, just playing devil's advocate.


Regards.
--
Eric Sammer
e...@lifeless.net
http://esammer.blogspot.com


why not zookeeper for the namenode

2010-02-19 Thread Thomas Koch
Hi,

yesterday I read the documentation of zookeeper and the zk contrib bookkeeper. 
From what I read, I thought, that bookkeeper would be the ideal enhancement 
for the namenode, to make it distributed and therefor finaly highly available.
Now I searched, if work in that direction has already started and found out, 
that apparently a totaly different approach has been choosen:
http://issues.apache.org/jira/browse/HADOOP-4539

Since I'm new to hadoop, I do trust in your decision. However I'd be glad, if 
somebody could satisfy my curiosity:

- Why hasn't zookeeper(-bookkeeper) not been choosen? Especially since it  
  seems to do a similiar job already in hbase.

- Isn't it, that with HADOOP-4539 client's can only connect to one namenode at 
  a time, leaving the burden of all reads and writes on the one's shoulder?

- Isn't it, that zookeeper would be more network efficient. It requires only a
  majority of nodes to receive a change, while HADOOP-4539 seems to require
  all backup nodes to receive a change before its persisted.

Thanks for any explanation,

Thomas Koch, http://www.koch.ro


Re: why not zookeeper for the namenode

2010-02-19 Thread Todd Lipcon
On Fri, Feb 19, 2010 at 12:41 AM, Thomas Koch tho...@koch.ro wrote:
 Hi,

 yesterday I read the documentation of zookeeper and the zk contrib bookkeeper.
 From what I read, I thought, that bookkeeper would be the ideal enhancement
 for the namenode, to make it distributed and therefor finaly highly available.
 Now I searched, if work in that direction has already started and found out,
 that apparently a totaly different approach has been choosen:
 http://issues.apache.org/jira/browse/HADOOP-4539

 Since I'm new to hadoop, I do trust in your decision. However I'd be glad, if
 somebody could satisfy my curiosity:


I didn't work on that particular design, but I'll do my best to answer
your questions below:

 - Why hasn't zookeeper(-bookkeeper) not been choosen? Especially since it
  seems to do a similiar job already in hbase.


HBase does not use Bookkeeper, currently. Rather, it just uses ZK for
election and some small amount of metadata tracking. It therefore is
only storing a small amount of data in ZK, whereas the Hadoop NN would
have to store many GB worth of namesystem data. I don't think anyone
has tried putting such a large amount of data in ZK yet, and being the
first to do something is never without problems :)

Additionally, when this design was made, Bookkeeper was very new. It's
still in development, as I understand it.

 - Isn't it, that with HADOOP-4539 client's can only connect to one namenode at
  a time, leaving the burden of all reads and writes on the one's shoulder?


Yes.

 - Isn't it, that zookeeper would be more network efficient. It requires only a
  majority of nodes to receive a change, while HADOOP-4539 seems to require
  all backup nodes to receive a change before its persisted.


Potentially. However, all backup nodes is usually just 1. In our
experience, and the experience of most other Hadoop deployments I've
spoken with, the primary factors decreasing NN availability are *not*
system crashes, but rather lack of online upgrade capability, slow
restart time for planned restarts, etc. Adding a hot standby can help
with the planned upgrade situation, but two standbys doesn't give you
much reliability above one. In a datacenter, the failure correlations
are generally such that racks either fail independently, or the entire
DC has lost power. So, there aren't a lot of cases where 3 NN replicas
would buy you much over 2.

-Todd

 Thanks for any explanation,

 Thomas Koch, http://www.koch.ro