Re: Managing multi-site clusters with Zookeeper

2010-03-15 Thread Benjamin Reed
it is a bit confusing but initLimit is the timer that is used when a 
follower connects to a leader. there may be some state transfers 
involved to bring the follower up to speed so we need to be able to 
allow a little extra time for the initial connection.


after that we use syncLimit to figure out if a leader or follower is 
dead. a peer (leader or follower) is considered dead if syncLimit ticks 
goes by without hearing from the other machine. (this is after the 
initial connection has been made.)


please open a jira to made the text a bit more explicit. feel free to 
add suggestions :)


thanx
ben

On 03/15/2010 04:17 AM, Michael Bauland wrote:

Hi Patrick,

I'm also setting up a Zookeeper ensemble across three different
locations and I've got some questions regarding the parameters as
specified on the page you mentioned:

   

That's controlled by the tickTime/synclimit/initlimit/etc.. see more
about this in the admin guide: http://bit.ly/c726DC
 

- What's the difference between initLimit and syncLimit? For initLimit
it says this is the time to allow followers to connect and sync to a
leader, and syncLimit is the time to allow followers to sync with
ZooKeeper. To me this sounds very similar, since Zookeeper in the
second definition probably means the Zookeeper leader, doesn't it?

- When I connect with a client to the Zookeeper ensemble I provide the
three IP addresses of my three Zookeeper servers. Does the client then
choose one of them arbitrarily or will it always try to connect to the
first one first? I'm asking since I would like to have my clients first
try to connect to the local Zookeeper server and only if that fails (for
whatever reason, maybe it's down) it should try to connect to one of the
servers on a different physical location.


   

You'll want to increase from the defaults since those are typically for
high performance interconnect (ie within colo). You are correct though,
much will depend on your env. and some tuning will be involved.
 

Do you have any suggestions for the parameters? So far I left tickTime
at 2 sec and increased initLimit and syncLimit to 30 (i.e., one minute).

Our sites are connected with 1Gbit to the Internet, but of course we
have no influence on what's in between. The data managed by zookeeper is
quite large (snapshots are 700 MByte, but they may increase in the future).

Thanks for your help,

Michael


   




Re: Managing multi-site clusters with Zookeeper

2010-03-15 Thread Patrick Hunt


Michael Bauland wrote:

- When I connect with a client to the Zookeeper ensemble I provide the
three IP addresses of my three Zookeeper servers. Does the client then
choose one of them arbitrarily or will it always try to connect to the
first one first? I'm asking since I would like to have my clients first
try to connect to the local Zookeeper server and only if that fails (for
whatever reason, maybe it's down) it should try to connect to one of the
servers on a different physical location.


It explicitly randomizes the list, this is to eliminate all clients 
connecting to the same servers in the same order, rather it distributes 
the load.


There's an option to turn this off in the c client, but not in the java 
client. However this is intended for testing purposes, not for 
production use. We could add it to the java client (create a JIRA if you 
like) however I'm not sure it would solve your problem. Once the local 
server is accessible again there'd be nothing to cause the client to 
connect to the local server. If the issue is intermittent it could 
non-optimal.


An option is to have more than one local server. So if you distrib btw 3 
colos then use 7 servers (3-2-2) and have the client connect to the 2 
local. This handles local failure, however it does not handle 
partitioning of the local servers from the remote ensemble members. 
However if both local servers are unable to connect to remote servers it 
seems likely that the client couldn't either (network partition). Not an 
optimal solution either unfortunately.


We have talked about adding this functionality to the ZK client (connect 
to the server with lowest latency/load/connections/etc... first). 
However this is not currently implemented. There's also the issue of how 
to decide when to switch to another server (ie when local comes back). 
For the time being you may have to handle this within your own code (two 
possible sessions based on connect string).





You'll want to increase from the defaults since those are typically for
high performance interconnect (ie within colo). You are correct though,
much will depend on your env. and some tuning will be involved.


Do you have any suggestions for the parameters? So far I left tickTime
at 2 sec and increased initLimit and syncLimit to 30 (i.e., one minute).

Our sites are connected with 1Gbit to the Internet, but of course we
have no influence on what's in between. The data managed by zookeeper is
quite large (snapshots are 700 MByte, but they may increase in the future).


What's you latency look like? Try using something like ping btw your 
colos over a longish period of time. Then look at the min/max/avg 
results that you see. What are you seeing? That along with bandwidth 
measurement (copy files using scp say) will help you decide.


We'd definitely be interested in your feedback as part of this process. 
Both in terms of docs (ff to enter jiras) and other insights you have as 
part of the effort.


Regards,

Patrick


Re: Managing multi-site clusters with Zookeeper

2010-03-15 Thread Flavio Junqueira
On top of Ben's description, you probably need to set initLimit to  
several minutes to transfer 700MB (worst case). The value of  
syncLimit, however, does not need to be that large.


-Flavio

On Mar 15, 2010, at 7:24 PM, Benjamin Reed wrote:


it is a bit confusing but initLimit is the timer that is used when a
follower connects to a leader. there may be some state transfers
involved to bring the follower up to speed so we need to be able to
allow a little extra time for the initial connection.

after that we use syncLimit to figure out if a leader or follower is
dead. a peer (leader or follower) is considered dead if syncLimit  
ticks

goes by without hearing from the other machine. (this is after the
initial connection has been made.)

please open a jira to made the text a bit more explicit. feel free to
add suggestions :)

thanx
ben

On 03/15/2010 04:17 AM, Michael Bauland wrote:

Hi Patrick,

I'm also setting up a Zookeeper ensemble across three different
locations and I've got some questions regarding the parameters as
specified on the page you mentioned:


That's controlled by the tickTime/synclimit/initlimit/etc.. see  
more

about this in the admin guide: http://bit.ly/c726DC

- What's the difference between initLimit and syncLimit? For  
initLimit

it says this is the time to allow followers to connect and sync to a
leader, and syncLimit is the time to allow followers to sync with
ZooKeeper. To me this sounds very similar, since Zookeeper in the
second definition probably means the Zookeeper leader, doesn't it?

- When I connect with a client to the Zookeeper ensemble I provide  
the
three IP addresses of my three Zookeeper servers. Does the client  
then
choose one of them arbitrarily or will it always try to connect to  
the
first one first? I'm asking since I would like to have my clients  
first
try to connect to the local Zookeeper server and only if that fails  
(for
whatever reason, maybe it's down) it should try to connect to one  
of the

servers on a different physical location.



You'll want to increase from the defaults since those are  
typically for
high performance interconnect (ie within colo). You are correct  
though,

much will depend on your env. and some tuning will be involved.

Do you have any suggestions for the parameters? So far I left  
tickTime
at 2 sec and increased initLimit and syncLimit to 30 (i.e., one  
minute).


Our sites are connected with 1Gbit to the Internet, but of course we
have no influence on what's in between. The data managed by  
zookeeper is
quite large (snapshots are 700 MByte, but they may increase in the  
future).


Thanks for your help,

Michael









Re: Managing multi-site clusters with Zookeeper

2010-03-08 Thread Martin Waite
Hi Ted,

If the links do not work for us for zk, then they are unlikely to work with
any other solution - such as trying to stretch Pacemaker or Red Hat Cluster
with their multicast protocols across the links.

If the links are not good enough, we might have to spend some more money to
fix this.

regards,
Martin

On 8 March 2010 02:14, Ted Dunning ted.dunn...@gmail.com wrote:

 If you can stand the latency for updates then zk should work well for you.
 It is unlikely that you will be able to better than zk does and still
 maintain correctness.

 Do note that you can, probalbly bias client to use a local server. That
 should make things more efficient.

 Sent from my iPhone


 On Mar 7, 2010, at 3:00 PM, Mahadev Konar maha...@yahoo-inc.com wrote:

  The inter-site links are a nuisance.  We have two data-centres with 100Mb
 links which I hope would be good enough for most uses, but we need a 3rd
 site - and currently that only has 2Mb links to the other sites.  This
 might
 be a problem.




Re: Managing multi-site clusters with Zookeeper

2010-03-08 Thread Patrick Hunt
IMO latency is the primary issue you will face, but also keep in mind 
reliability w/in a colo.


Say you have 3 colos (obv can't be 2), if you only have 3 servers, one 
in each colo, you will be reliable but clients w/in each colo will have 
to connect to a remote colo if the local fails. You will want to 
prioritize the local colo given that reads can be serviced entirely 
local that way. If you have 7 servers (2-2-3) that would be better - if 
a local server fails you have a redundant, if both fail then you go remote.


You want to keep your writes as few as possible and as small as 
possible? Why? Say you have 100ms latency btw colos, let's go through a 
scenario for a client in a colo where the local servers are not the 
leader (zk cluster leader).


read:
1) client reads a znode from local server
2) local server (usually  1ms if in colo comm) responds in 1ms

write:
1) client writes a znode to local server A
2) A proposes change to the ZK Leader (L) in remote colo
3) L gets the proposal in 100ms
4) L proposes the change to all followers
5) all followers (not exactly, but hopefully) get the proposal in 100ms
6) followers ack the change
7) L gets the acks in 100ms
8) L commits the change (message to all followers)
9) A gets the commit in 100ms
10) A responds to client ( 1ms)

write latency: 100 + 100 + 100 + 100 = 400ms

Obviously keeping these writes small is also critical.

Patrick

Martin Waite wrote:

Hi Ted,

If the links do not work for us for zk, then they are unlikely to work with
any other solution - such as trying to stretch Pacemaker or Red Hat Cluster
with their multicast protocols across the links.

If the links are not good enough, we might have to spend some more money to
fix this.

regards,
Martin

On 8 March 2010 02:14, Ted Dunning ted.dunn...@gmail.com wrote:


If you can stand the latency for updates then zk should work well for you.
It is unlikely that you will be able to better than zk does and still
maintain correctness.

Do note that you can, probalbly bias client to use a local server. That
should make things more efficient.

Sent from my iPhone


On Mar 7, 2010, at 3:00 PM, Mahadev Konar maha...@yahoo-inc.com wrote:

 The inter-site links are a nuisance.  We have two data-centres with 100Mb

links which I hope would be good enough for most uses, but we need a 3rd
site - and currently that only has 2Mb links to the other sites.  This
might
be a problem.





Re: Managing multi-site clusters with Zookeeper

2010-03-08 Thread Martin Waite
Hi Patrick,

Thanks for you input.

I am planning on having 3 zk servers per data centre, with perhaps only 2 in
the tie-breaker site.

The traffic between zk and the applications will be lots of local reads -
who is the primary database ?.  Changes to the config will be rare (server
rebuilds, etc - ie. planned changes) or caused by server / network / site
failure.

The interesting thing in my mind is how zookeeper will cope with inter-site
link failure - how quickly the remote sites will notice, and how quickly
normality can be resumed when the link reappears.

I need to get this running in the lab and start pulling out wires.

regards,
Martin

On 8 March 2010 17:39, Patrick Hunt ph...@apache.org wrote:

 IMO latency is the primary issue you will face, but also keep in mind
 reliability w/in a colo.

 Say you have 3 colos (obv can't be 2), if you only have 3 servers, one in
 each colo, you will be reliable but clients w/in each colo will have to
 connect to a remote colo if the local fails. You will want to prioritize the
 local colo given that reads can be serviced entirely local that way. If you
 have 7 servers (2-2-3) that would be better - if a local server fails you
 have a redundant, if both fail then you go remote.

 You want to keep your writes as few as possible and as small as possible?
 Why? Say you have 100ms latency btw colos, let's go through a scenario for a
 client in a colo where the local servers are not the leader (zk cluster
 leader).

 read:
 1) client reads a znode from local server
 2) local server (usually  1ms if in colo comm) responds in 1ms

 write:
 1) client writes a znode to local server A
 2) A proposes change to the ZK Leader (L) in remote colo
 3) L gets the proposal in 100ms
 4) L proposes the change to all followers
 5) all followers (not exactly, but hopefully) get the proposal in 100ms
 6) followers ack the change
 7) L gets the acks in 100ms
 8) L commits the change (message to all followers)
 9) A gets the commit in 100ms
 10) A responds to client ( 1ms)

 write latency: 100 + 100 + 100 + 100 = 400ms

 Obviously keeping these writes small is also critical.

 Patrick


 Martin Waite wrote:

 Hi Ted,

 If the links do not work for us for zk, then they are unlikely to work
 with
 any other solution - such as trying to stretch Pacemaker or Red Hat
 Cluster
 with their multicast protocols across the links.

 If the links are not good enough, we might have to spend some more money
 to
 fix this.

 regards,
 Martin

 On 8 March 2010 02:14, Ted Dunning ted.dunn...@gmail.com wrote:

  If you can stand the latency for updates then zk should work well for
 you.
 It is unlikely that you will be able to better than zk does and still
 maintain correctness.

 Do note that you can, probalbly bias client to use a local server. That
 should make things more efficient.

 Sent from my iPhone


 On Mar 7, 2010, at 3:00 PM, Mahadev Konar maha...@yahoo-inc.com wrote:

  The inter-site links are a nuisance.  We have two data-centres with
 100Mb

 links which I hope would be good enough for most uses, but we need a 3rd
 site - and currently that only has 2Mb links to the other sites.  This
 might
 be a problem.





Re: Managing multi-site clusters with Zookeeper

2010-03-08 Thread Mahadev Konar
HI Martin,
  The results would be really nice information to have on ZooKeeper wiki.
Would be very helpful for others considering the same kind of deployment.
So, do send out your results on the list.


Thanks
mahadev


On 3/8/10 11:18 AM, Martin Waite waite@googlemail.com wrote:

 Hi Patrick,
 
 Thanks for you input.
 
 I am planning on having 3 zk servers per data centre, with perhaps only 2 in
 the tie-breaker site.
 
 The traffic between zk and the applications will be lots of local reads -
 who is the primary database ?.  Changes to the config will be rare (server
 rebuilds, etc - ie. planned changes) or caused by server / network / site
 failure.
 
 The interesting thing in my mind is how zookeeper will cope with inter-site
 link failure - how quickly the remote sites will notice, and how quickly
 normality can be resumed when the link reappears.
 
 I need to get this running in the lab and start pulling out wires.
 
 regards,
 Martin
 
 On 8 March 2010 17:39, Patrick Hunt ph...@apache.org wrote:
 
 IMO latency is the primary issue you will face, but also keep in mind
 reliability w/in a colo.
 
 Say you have 3 colos (obv can't be 2), if you only have 3 servers, one in
 each colo, you will be reliable but clients w/in each colo will have to
 connect to a remote colo if the local fails. You will want to prioritize the
 local colo given that reads can be serviced entirely local that way. If you
 have 7 servers (2-2-3) that would be better - if a local server fails you
 have a redundant, if both fail then you go remote.
 
 You want to keep your writes as few as possible and as small as possible?
 Why? Say you have 100ms latency btw colos, let's go through a scenario for a
 client in a colo where the local servers are not the leader (zk cluster
 leader).
 
 read:
 1) client reads a znode from local server
 2) local server (usually  1ms if in colo comm) responds in 1ms
 
 write:
 1) client writes a znode to local server A
 2) A proposes change to the ZK Leader (L) in remote colo
 3) L gets the proposal in 100ms
 4) L proposes the change to all followers
 5) all followers (not exactly, but hopefully) get the proposal in 100ms
 6) followers ack the change
 7) L gets the acks in 100ms
 8) L commits the change (message to all followers)
 9) A gets the commit in 100ms
 10) A responds to client ( 1ms)
 
 write latency: 100 + 100 + 100 + 100 = 400ms
 
 Obviously keeping these writes small is also critical.
 
 Patrick
 
 
 Martin Waite wrote:
 
 Hi Ted,
 
 If the links do not work for us for zk, then they are unlikely to work
 with
 any other solution - such as trying to stretch Pacemaker or Red Hat
 Cluster
 with their multicast protocols across the links.
 
 If the links are not good enough, we might have to spend some more money
 to
 fix this.
 
 regards,
 Martin
 
 On 8 March 2010 02:14, Ted Dunning ted.dunn...@gmail.com wrote:
 
  If you can stand the latency for updates then zk should work well for
 you.
 It is unlikely that you will be able to better than zk does and still
 maintain correctness.
 
 Do note that you can, probalbly bias client to use a local server. That
 should make things more efficient.
 
 Sent from my iPhone
 
 
 On Mar 7, 2010, at 3:00 PM, Mahadev Konar maha...@yahoo-inc.com wrote:
 
  The inter-site links are a nuisance.  We have two data-centres with
 100Mb
 
 links which I hope would be good enough for most uses, but we need a 3rd
 site - and currently that only has 2Mb links to the other sites.  This
 might
 be a problem.
 
 
 



Re: Managing multi-site clusters with Zookeeper

2010-03-08 Thread Patrick Hunt
That's controlled by the tickTime/synclimit/initlimit/etc.. see more 
about this in the admin guide: http://bit.ly/c726DC


You'll want to increase from the defaults since those are typically for 
high performance interconnect (ie within colo). You are correct though, 
much will depend on your env. and some tuning will be involved.


Patrick

Martin Waite wrote:

Hi Patrick,

Thanks for you input.

I am planning on having 3 zk servers per data centre, with perhaps only 2 in
the tie-breaker site.

The traffic between zk and the applications will be lots of local reads -
who is the primary database ?.  Changes to the config will be rare (server
rebuilds, etc - ie. planned changes) or caused by server / network / site
failure.

The interesting thing in my mind is how zookeeper will cope with inter-site
link failure - how quickly the remote sites will notice, and how quickly
normality can be resumed when the link reappears.

I need to get this running in the lab and start pulling out wires.

regards,
Martin

On 8 March 2010 17:39, Patrick Hunt ph...@apache.org wrote:


IMO latency is the primary issue you will face, but also keep in mind
reliability w/in a colo.

Say you have 3 colos (obv can't be 2), if you only have 3 servers, one in
each colo, you will be reliable but clients w/in each colo will have to
connect to a remote colo if the local fails. You will want to prioritize the
local colo given that reads can be serviced entirely local that way. If you
have 7 servers (2-2-3) that would be better - if a local server fails you
have a redundant, if both fail then you go remote.

You want to keep your writes as few as possible and as small as possible?
Why? Say you have 100ms latency btw colos, let's go through a scenario for a
client in a colo where the local servers are not the leader (zk cluster
leader).

read:
1) client reads a znode from local server
2) local server (usually  1ms if in colo comm) responds in 1ms

write:
1) client writes a znode to local server A
2) A proposes change to the ZK Leader (L) in remote colo
3) L gets the proposal in 100ms
4) L proposes the change to all followers
5) all followers (not exactly, but hopefully) get the proposal in 100ms
6) followers ack the change
7) L gets the acks in 100ms
8) L commits the change (message to all followers)
9) A gets the commit in 100ms
10) A responds to client ( 1ms)

write latency: 100 + 100 + 100 + 100 = 400ms

Obviously keeping these writes small is also critical.

Patrick


Martin Waite wrote:


Hi Ted,

If the links do not work for us for zk, then they are unlikely to work
with
any other solution - such as trying to stretch Pacemaker or Red Hat
Cluster
with their multicast protocols across the links.

If the links are not good enough, we might have to spend some more money
to
fix this.

regards,
Martin

On 8 March 2010 02:14, Ted Dunning ted.dunn...@gmail.com wrote:

 If you can stand the latency for updates then zk should work well for

you.
It is unlikely that you will be able to better than zk does and still
maintain correctness.

Do note that you can, probalbly bias client to use a local server. That
should make things more efficient.

Sent from my iPhone


On Mar 7, 2010, at 3:00 PM, Mahadev Konar maha...@yahoo-inc.com wrote:

 The inter-site links are a nuisance.  We have two data-centres with
100Mb


links which I hope would be good enough for most uses, but we need a 3rd

site - and currently that only has 2Mb links to the other sites.  This
might
be a problem.






Re: Managing multi-site clusters with Zookeeper

2010-03-07 Thread Mahadev Konar
Hi Martin,
 As Ted rightly mentions that ZooKeeper usually is run within a colo because
of the low latency requirements of applications that it supports.

Its definitely reasnoble to use it in a multi data center environments but
you should realize the implications of it. The high latency/low throughput
means that you should make minimal use of such a ZooKeeper ensemble.

Also, there are things like the tick Time, the syncLimit and others (setup
parameters for ZooKeeper in config) which you will need to tune a little to
get ZooKeeper running without many hiccups in this environment.

Thanks
mahadev


On 3/6/10 10:29 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 What you describe is relatively reasonable, even though Zookeeper is not
 normally distributed across multiple data centers with all members getting
 full votes.  If you account for the limited throughput that this will impose
 on your applications that use ZK, then I think that this can work well.
 Probably, you would have local ZK clusters for higher transaction rate
 applications.
 
 You should also consider very carefully whether having multiple data centers
 increases or decreases your overall reliability.  Unless you design very
 carefully, this will normally substantially degrade reliability.  Making
 sure that it increases reliability is a really big task that involves a lot
 of surprising (it was to me) considerations and considerable hardware and
 time investments.
 
 Good luck!
 
 On Sat, Mar 6, 2010 at 1:50 AM, Martin Waite waite@googlemail.comwrote:
 
 Is this a viable approach, or am I taking Zookeeper out of its application
 domain and just asking for trouble ?
 
 
 



Re: Managing multi-site clusters with Zookeeper

2010-03-07 Thread Martin Waite
Hi Mahadev,

The inter-site links are a nuisance.  We have two data-centres with 100Mb
links which I hope would be good enough for most uses, but we need a 3rd
site - and currently that only has 2Mb links to the other sites.  This might
be a problem.

The ensemble would have a lot of read traffic from applications asking which
database to connect to for each transaction - which presumably would be
mostly handled by local zookeeper servers (do we call these nodes as
opposed to znodes ?).  The write traffic would be mostly changes to
configuration (a rare event), and changes in the health of database servers
- also hopefully rare.  I suppose the main concern is how much ambient
zookeeper system chatter will cross the links.   Are there any measurements
of how much traffic is used by zookeeper in maintaining the ensemble ?

Another question that occurs is whether I can link sites A,B, and C in a
ring - so that if any one site drops out, the remaining 2 continue to talk.
I suppose that if the zookeeper servers are all in direct contact with each
other, this issue does not exist.

regards,
Martin

On 7 March 2010 21:43, Mahadev Konar maha...@yahoo-inc.com wrote:

 Hi Martin,
  As Ted rightly mentions that ZooKeeper usually is run within a colo
 because
 of the low latency requirements of applications that it supports.

 Its definitely reasnoble to use it in a multi data center environments but
 you should realize the implications of it. The high latency/low throughput
 means that you should make minimal use of such a ZooKeeper ensemble.

 Also, there are things like the tick Time, the syncLimit and others (setup
 parameters for ZooKeeper in config) which you will need to tune a little to
 get ZooKeeper running without many hiccups in this environment.

 Thanks
 mahadev


 On 3/6/10 10:29 AM, Ted Dunning ted.dunn...@gmail.com wrote:

  What you describe is relatively reasonable, even though Zookeeper is not
  normally distributed across multiple data centers with all members
 getting
  full votes.  If you account for the limited throughput that this will
 impose
  on your applications that use ZK, then I think that this can work well.
  Probably, you would have local ZK clusters for higher transaction rate
  applications.
 
  You should also consider very carefully whether having multiple data
 centers
  increases or decreases your overall reliability.  Unless you design very
  carefully, this will normally substantially degrade reliability.  Making
  sure that it increases reliability is a really big task that involves a
 lot
  of surprising (it was to me) considerations and considerable hardware and
  time investments.
 
  Good luck!
 
  On Sat, Mar 6, 2010 at 1:50 AM, Martin Waite waite@googlemail.com
 wrote:
 
  Is this a viable approach, or am I taking Zookeeper out of its
 application
  domain and just asking for trouble ?
 
 
 




Re: Managing multi-site clusters with Zookeeper

2010-03-07 Thread Mahadev Konar
Martin,
 2Mb link might certainly be a problem. We can refer to these nodes as
ZooKeeper servers. Znodes is used to data elements in the ZooKeeper data
tree.

The Zookeeper ensemble has minimal traffic which is basically health checks
between the members of the ensemble. We call one of the members as Leader
who is leading the ensemble and the others as Followers. The Leader does
periodic health checks to see if the Followers are doing fine. This is of
the order of  1KB/sec.

There is some traffic when the leader election within the ensemble happens.
This might be of the order of 1-2KB/sec.

As you mentioned the reads happen locally. So, a good enough link within the
ensemble members is important so that these followers can be up to date with
the Leader. But again looking at your config, looks like its mostly read
only traffic. 

One more thing you should be aware of:
Lets says a ephemeral node was created and the client died, then the clients
connected to the slow ZooKeeper server (with 2Mb/s links) would lag behind
the other clients connected to the other servers.

As per my opinion you should do some testing since 2Mb/sec seems a little
dodgy.

Thanks
mahadev
 
On 3/7/10 2:09 PM, Martin Waite waite@googlemail.com wrote:

 Hi Mahadev,
 
 The inter-site links are a nuisance.  We have two data-centres with 100Mb
 links which I hope would be good enough for most uses, but we need a 3rd
 site - and currently that only has 2Mb links to the other sites.  This might
 be a problem.
 
 The ensemble would have a lot of read traffic from applications asking which
 database to connect to for each transaction - which presumably would be
 mostly handled by local zookeeper servers (do we call these nodes as
 opposed to znodes ?).  The write traffic would be mostly changes to
 configuration (a rare event), and changes in the health of database servers
 - also hopefully rare.  I suppose the main concern is how much ambient
 zookeeper system chatter will cross the links.   Are there any measurements
 of how much traffic is used by zookeeper in maintaining the ensemble ?
 
 Another question that occurs is whether I can link sites A,B, and C in a
 ring - so that if any one site drops out, the remaining 2 continue to talk.
 I suppose that if the zookeeper servers are all in direct contact with each
 other, this issue does not exist.
 
 regards,
 Martin
 
 On 7 March 2010 21:43, Mahadev Konar maha...@yahoo-inc.com wrote:
 
 Hi Martin,
  As Ted rightly mentions that ZooKeeper usually is run within a colo
 because
 of the low latency requirements of applications that it supports.
 
 Its definitely reasnoble to use it in a multi data center environments but
 you should realize the implications of it. The high latency/low throughput
 means that you should make minimal use of such a ZooKeeper ensemble.
 
 Also, there are things like the tick Time, the syncLimit and others (setup
 parameters for ZooKeeper in config) which you will need to tune a little to
 get ZooKeeper running without many hiccups in this environment.
 
 Thanks
 mahadev
 
 
 On 3/6/10 10:29 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 What you describe is relatively reasonable, even though Zookeeper is not
 normally distributed across multiple data centers with all members
 getting
 full votes.  If you account for the limited throughput that this will
 impose
 on your applications that use ZK, then I think that this can work well.
 Probably, you would have local ZK clusters for higher transaction rate
 applications.
 
 You should also consider very carefully whether having multiple data
 centers
 increases or decreases your overall reliability.  Unless you design very
 carefully, this will normally substantially degrade reliability.  Making
 sure that it increases reliability is a really big task that involves a
 lot
 of surprising (it was to me) considerations and considerable hardware and
 time investments.
 
 Good luck!
 
 On Sat, Mar 6, 2010 at 1:50 AM, Martin Waite waite@googlemail.com
 wrote:
 
 Is this a viable approach, or am I taking Zookeeper out of its
 application
 domain and just asking for trouble ?
 
 
 
 
 



Re: Managing multi-site clusters with Zookeeper

2010-03-07 Thread Ted Dunning
If you can stand the latency for updates then zk should work well for  
you. It is unlikely that you will be able to better than zk does and  
still maintain correctness.


Do note that you can, probalbly bias client to use a local server.  
That should make things more efficient.


Sent from my iPhone

On Mar 7, 2010, at 3:00 PM, Mahadev Konar maha...@yahoo-inc.com wrote:

The inter-site links are a nuisance.  We have two data-centres with  
100Mb
links which I hope would be good enough for most uses, but we need  
a 3rd
site - and currently that only has 2Mb links to the other sites.   
This might

be a problem.


Managing multi-site clusters with Zookeeper

2010-03-06 Thread Martin Waite
Hi,

We're attempting to build a multi-site cluster:

   1. web-tier and application tier is active in all sites
   2. only one database is active at a time- normally in the designated
   primary site

We want to use 3 sites to maintain a quorum.  So, if the Primary site loses
sight of both of the other sites, it will close down itself down.  If the
other sites both lose sight of the Primary site, they will co-operate in
electing one of the pair as the new primary, and bring up the database
services.

I am thinking that in each site, a number of sentinel processes could hold
open ephemeral znodes flagging that the site is up - with names like
site1/sentinel-1.  These sentinels could be plugged into local health
monitoring, and when the site falls into dis-repair, remove themselves.  If
links between sites fail, then the ephemeral nodes would disappear too.

Each site would have a process that periodically checks the presence of the
sentinel znodes of the other sites.  If all disappear, then the site knows
it is in a minority partition, and shuts down services as required.

Is this a viable approach, or am I taking Zookeeper out of its application
domain and just asking for trouble ?

regards,
Martin


Re: Managing multi-site clusters with Zookeeper

2010-03-06 Thread Ted Dunning
What you describe is relatively reasonable, even though Zookeeper is not
normally distributed across multiple data centers with all members getting
full votes.  If you account for the limited throughput that this will impose
on your applications that use ZK, then I think that this can work well.
Probably, you would have local ZK clusters for higher transaction rate
applications.

You should also consider very carefully whether having multiple data centers
increases or decreases your overall reliability.  Unless you design very
carefully, this will normally substantially degrade reliability.  Making
sure that it increases reliability is a really big task that involves a lot
of surprising (it was to me) considerations and considerable hardware and
time investments.

Good luck!

On Sat, Mar 6, 2010 at 1:50 AM, Martin Waite waite@googlemail.comwrote:

 Is this a viable approach, or am I taking Zookeeper out of its application
 domain and just asking for trouble ?




-- 
Ted Dunning, CTO
DeepDyve


Re: Managing multi-site clusters with Zookeeper

2010-03-06 Thread Martin Waite
I take your point about reliability, but I have no option other than finding
a multi-site solution.

Unfortunately, in my experience sites are much less reliable than individual
machines, and so in a way coping with site failure is more important than
individual machine failure.  I imagine that the risk profile changes
according to the number of machines you have, however.

Thanks for the input
Martin

On 6 March 2010 18:29, Ted Dunning ted.dunn...@gmail.com wrote:

 What you describe is relatively reasonable, even though Zookeeper is not
 normally distributed across multiple data centers with all members getting
 full votes.  If you account for the limited throughput that this will
 impose
 on your applications that use ZK, then I think that this can work well.
 Probably, you would have local ZK clusters for higher transaction rate
 applications.

 You should also consider very carefully whether having multiple data
 centers
 increases or decreases your overall reliability.  Unless you design very
 carefully, this will normally substantially degrade reliability.  Making
 sure that it increases reliability is a really big task that involves a lot
 of surprising (it was to me) considerations and considerable hardware and
 time investments.

 Good luck!

 On Sat, Mar 6, 2010 at 1:50 AM, Martin Waite waite@googlemail.com
 wrote:

  Is this a viable approach, or am I taking Zookeeper out of its
 application
  domain and just asking for trouble ?
 



 --
 Ted Dunning, CTO
 DeepDyve