Re: Managing multi-site clusters with Zookeeper
it is a bit confusing but initLimit is the timer that is used when a follower connects to a leader. there may be some state transfers involved to bring the follower up to speed so we need to be able to allow a little extra time for the initial connection. after that we use syncLimit to figure out if a leader or follower is dead. a peer (leader or follower) is considered dead if syncLimit ticks goes by without hearing from the other machine. (this is after the initial connection has been made.) please open a jira to made the text a bit more explicit. feel free to add suggestions :) thanx ben On 03/15/2010 04:17 AM, Michael Bauland wrote: Hi Patrick, I'm also setting up a Zookeeper ensemble across three different locations and I've got some questions regarding the parameters as specified on the page you mentioned: That's controlled by the tickTime/synclimit/initlimit/etc.. see more about this in the admin guide: http://bit.ly/c726DC - What's the difference between initLimit and syncLimit? For initLimit it says this is the time to allow followers to connect and sync to a leader, and syncLimit is the time to allow followers to sync with ZooKeeper. To me this sounds very similar, since Zookeeper in the second definition probably means the Zookeeper leader, doesn't it? - When I connect with a client to the Zookeeper ensemble I provide the three IP addresses of my three Zookeeper servers. Does the client then choose one of them arbitrarily or will it always try to connect to the first one first? I'm asking since I would like to have my clients first try to connect to the local Zookeeper server and only if that fails (for whatever reason, maybe it's down) it should try to connect to one of the servers on a different physical location. You'll want to increase from the defaults since those are typically for high performance interconnect (ie within colo). You are correct though, much will depend on your env. and some tuning will be involved. Do you have any suggestions for the parameters? So far I left tickTime at 2 sec and increased initLimit and syncLimit to 30 (i.e., one minute). Our sites are connected with 1Gbit to the Internet, but of course we have no influence on what's in between. The data managed by zookeeper is quite large (snapshots are 700 MByte, but they may increase in the future). Thanks for your help, Michael
Re: Managing multi-site clusters with Zookeeper
Michael Bauland wrote: - When I connect with a client to the Zookeeper ensemble I provide the three IP addresses of my three Zookeeper servers. Does the client then choose one of them arbitrarily or will it always try to connect to the first one first? I'm asking since I would like to have my clients first try to connect to the local Zookeeper server and only if that fails (for whatever reason, maybe it's down) it should try to connect to one of the servers on a different physical location. It explicitly randomizes the list, this is to eliminate all clients connecting to the same servers in the same order, rather it distributes the load. There's an option to turn this off in the c client, but not in the java client. However this is intended for testing purposes, not for production use. We could add it to the java client (create a JIRA if you like) however I'm not sure it would solve your problem. Once the local server is accessible again there'd be nothing to cause the client to connect to the local server. If the issue is intermittent it could non-optimal. An option is to have more than one local server. So if you distrib btw 3 colos then use 7 servers (3-2-2) and have the client connect to the 2 local. This handles local failure, however it does not handle partitioning of the local servers from the remote ensemble members. However if both local servers are unable to connect to remote servers it seems likely that the client couldn't either (network partition). Not an optimal solution either unfortunately. We have talked about adding this functionality to the ZK client (connect to the server with lowest latency/load/connections/etc... first). However this is not currently implemented. There's also the issue of how to decide when to switch to another server (ie when local comes back). For the time being you may have to handle this within your own code (two possible sessions based on connect string). You'll want to increase from the defaults since those are typically for high performance interconnect (ie within colo). You are correct though, much will depend on your env. and some tuning will be involved. Do you have any suggestions for the parameters? So far I left tickTime at 2 sec and increased initLimit and syncLimit to 30 (i.e., one minute). Our sites are connected with 1Gbit to the Internet, but of course we have no influence on what's in between. The data managed by zookeeper is quite large (snapshots are 700 MByte, but they may increase in the future). What's you latency look like? Try using something like ping btw your colos over a longish period of time. Then look at the min/max/avg results that you see. What are you seeing? That along with bandwidth measurement (copy files using scp say) will help you decide. We'd definitely be interested in your feedback as part of this process. Both in terms of docs (ff to enter jiras) and other insights you have as part of the effort. Regards, Patrick
Re: Managing multi-site clusters with Zookeeper
On top of Ben's description, you probably need to set initLimit to several minutes to transfer 700MB (worst case). The value of syncLimit, however, does not need to be that large. -Flavio On Mar 15, 2010, at 7:24 PM, Benjamin Reed wrote: it is a bit confusing but initLimit is the timer that is used when a follower connects to a leader. there may be some state transfers involved to bring the follower up to speed so we need to be able to allow a little extra time for the initial connection. after that we use syncLimit to figure out if a leader or follower is dead. a peer (leader or follower) is considered dead if syncLimit ticks goes by without hearing from the other machine. (this is after the initial connection has been made.) please open a jira to made the text a bit more explicit. feel free to add suggestions :) thanx ben On 03/15/2010 04:17 AM, Michael Bauland wrote: Hi Patrick, I'm also setting up a Zookeeper ensemble across three different locations and I've got some questions regarding the parameters as specified on the page you mentioned: That's controlled by the tickTime/synclimit/initlimit/etc.. see more about this in the admin guide: http://bit.ly/c726DC - What's the difference between initLimit and syncLimit? For initLimit it says this is the time to allow followers to connect and sync to a leader, and syncLimit is the time to allow followers to sync with ZooKeeper. To me this sounds very similar, since Zookeeper in the second definition probably means the Zookeeper leader, doesn't it? - When I connect with a client to the Zookeeper ensemble I provide the three IP addresses of my three Zookeeper servers. Does the client then choose one of them arbitrarily or will it always try to connect to the first one first? I'm asking since I would like to have my clients first try to connect to the local Zookeeper server and only if that fails (for whatever reason, maybe it's down) it should try to connect to one of the servers on a different physical location. You'll want to increase from the defaults since those are typically for high performance interconnect (ie within colo). You are correct though, much will depend on your env. and some tuning will be involved. Do you have any suggestions for the parameters? So far I left tickTime at 2 sec and increased initLimit and syncLimit to 30 (i.e., one minute). Our sites are connected with 1Gbit to the Internet, but of course we have no influence on what's in between. The data managed by zookeeper is quite large (snapshots are 700 MByte, but they may increase in the future). Thanks for your help, Michael
Re: Managing multi-site clusters with Zookeeper
Hi Ted, If the links do not work for us for zk, then they are unlikely to work with any other solution - such as trying to stretch Pacemaker or Red Hat Cluster with their multicast protocols across the links. If the links are not good enough, we might have to spend some more money to fix this. regards, Martin On 8 March 2010 02:14, Ted Dunning ted.dunn...@gmail.com wrote: If you can stand the latency for updates then zk should work well for you. It is unlikely that you will be able to better than zk does and still maintain correctness. Do note that you can, probalbly bias client to use a local server. That should make things more efficient. Sent from my iPhone On Mar 7, 2010, at 3:00 PM, Mahadev Konar maha...@yahoo-inc.com wrote: The inter-site links are a nuisance. We have two data-centres with 100Mb links which I hope would be good enough for most uses, but we need a 3rd site - and currently that only has 2Mb links to the other sites. This might be a problem.
Re: Managing multi-site clusters with Zookeeper
IMO latency is the primary issue you will face, but also keep in mind reliability w/in a colo. Say you have 3 colos (obv can't be 2), if you only have 3 servers, one in each colo, you will be reliable but clients w/in each colo will have to connect to a remote colo if the local fails. You will want to prioritize the local colo given that reads can be serviced entirely local that way. If you have 7 servers (2-2-3) that would be better - if a local server fails you have a redundant, if both fail then you go remote. You want to keep your writes as few as possible and as small as possible? Why? Say you have 100ms latency btw colos, let's go through a scenario for a client in a colo where the local servers are not the leader (zk cluster leader). read: 1) client reads a znode from local server 2) local server (usually 1ms if in colo comm) responds in 1ms write: 1) client writes a znode to local server A 2) A proposes change to the ZK Leader (L) in remote colo 3) L gets the proposal in 100ms 4) L proposes the change to all followers 5) all followers (not exactly, but hopefully) get the proposal in 100ms 6) followers ack the change 7) L gets the acks in 100ms 8) L commits the change (message to all followers) 9) A gets the commit in 100ms 10) A responds to client ( 1ms) write latency: 100 + 100 + 100 + 100 = 400ms Obviously keeping these writes small is also critical. Patrick Martin Waite wrote: Hi Ted, If the links do not work for us for zk, then they are unlikely to work with any other solution - such as trying to stretch Pacemaker or Red Hat Cluster with their multicast protocols across the links. If the links are not good enough, we might have to spend some more money to fix this. regards, Martin On 8 March 2010 02:14, Ted Dunning ted.dunn...@gmail.com wrote: If you can stand the latency for updates then zk should work well for you. It is unlikely that you will be able to better than zk does and still maintain correctness. Do note that you can, probalbly bias client to use a local server. That should make things more efficient. Sent from my iPhone On Mar 7, 2010, at 3:00 PM, Mahadev Konar maha...@yahoo-inc.com wrote: The inter-site links are a nuisance. We have two data-centres with 100Mb links which I hope would be good enough for most uses, but we need a 3rd site - and currently that only has 2Mb links to the other sites. This might be a problem.
Re: Managing multi-site clusters with Zookeeper
Hi Patrick, Thanks for you input. I am planning on having 3 zk servers per data centre, with perhaps only 2 in the tie-breaker site. The traffic between zk and the applications will be lots of local reads - who is the primary database ?. Changes to the config will be rare (server rebuilds, etc - ie. planned changes) or caused by server / network / site failure. The interesting thing in my mind is how zookeeper will cope with inter-site link failure - how quickly the remote sites will notice, and how quickly normality can be resumed when the link reappears. I need to get this running in the lab and start pulling out wires. regards, Martin On 8 March 2010 17:39, Patrick Hunt ph...@apache.org wrote: IMO latency is the primary issue you will face, but also keep in mind reliability w/in a colo. Say you have 3 colos (obv can't be 2), if you only have 3 servers, one in each colo, you will be reliable but clients w/in each colo will have to connect to a remote colo if the local fails. You will want to prioritize the local colo given that reads can be serviced entirely local that way. If you have 7 servers (2-2-3) that would be better - if a local server fails you have a redundant, if both fail then you go remote. You want to keep your writes as few as possible and as small as possible? Why? Say you have 100ms latency btw colos, let's go through a scenario for a client in a colo where the local servers are not the leader (zk cluster leader). read: 1) client reads a znode from local server 2) local server (usually 1ms if in colo comm) responds in 1ms write: 1) client writes a znode to local server A 2) A proposes change to the ZK Leader (L) in remote colo 3) L gets the proposal in 100ms 4) L proposes the change to all followers 5) all followers (not exactly, but hopefully) get the proposal in 100ms 6) followers ack the change 7) L gets the acks in 100ms 8) L commits the change (message to all followers) 9) A gets the commit in 100ms 10) A responds to client ( 1ms) write latency: 100 + 100 + 100 + 100 = 400ms Obviously keeping these writes small is also critical. Patrick Martin Waite wrote: Hi Ted, If the links do not work for us for zk, then they are unlikely to work with any other solution - such as trying to stretch Pacemaker or Red Hat Cluster with their multicast protocols across the links. If the links are not good enough, we might have to spend some more money to fix this. regards, Martin On 8 March 2010 02:14, Ted Dunning ted.dunn...@gmail.com wrote: If you can stand the latency for updates then zk should work well for you. It is unlikely that you will be able to better than zk does and still maintain correctness. Do note that you can, probalbly bias client to use a local server. That should make things more efficient. Sent from my iPhone On Mar 7, 2010, at 3:00 PM, Mahadev Konar maha...@yahoo-inc.com wrote: The inter-site links are a nuisance. We have two data-centres with 100Mb links which I hope would be good enough for most uses, but we need a 3rd site - and currently that only has 2Mb links to the other sites. This might be a problem.
Re: Managing multi-site clusters with Zookeeper
HI Martin, The results would be really nice information to have on ZooKeeper wiki. Would be very helpful for others considering the same kind of deployment. So, do send out your results on the list. Thanks mahadev On 3/8/10 11:18 AM, Martin Waite waite@googlemail.com wrote: Hi Patrick, Thanks for you input. I am planning on having 3 zk servers per data centre, with perhaps only 2 in the tie-breaker site. The traffic between zk and the applications will be lots of local reads - who is the primary database ?. Changes to the config will be rare (server rebuilds, etc - ie. planned changes) or caused by server / network / site failure. The interesting thing in my mind is how zookeeper will cope with inter-site link failure - how quickly the remote sites will notice, and how quickly normality can be resumed when the link reappears. I need to get this running in the lab and start pulling out wires. regards, Martin On 8 March 2010 17:39, Patrick Hunt ph...@apache.org wrote: IMO latency is the primary issue you will face, but also keep in mind reliability w/in a colo. Say you have 3 colos (obv can't be 2), if you only have 3 servers, one in each colo, you will be reliable but clients w/in each colo will have to connect to a remote colo if the local fails. You will want to prioritize the local colo given that reads can be serviced entirely local that way. If you have 7 servers (2-2-3) that would be better - if a local server fails you have a redundant, if both fail then you go remote. You want to keep your writes as few as possible and as small as possible? Why? Say you have 100ms latency btw colos, let's go through a scenario for a client in a colo where the local servers are not the leader (zk cluster leader). read: 1) client reads a znode from local server 2) local server (usually 1ms if in colo comm) responds in 1ms write: 1) client writes a znode to local server A 2) A proposes change to the ZK Leader (L) in remote colo 3) L gets the proposal in 100ms 4) L proposes the change to all followers 5) all followers (not exactly, but hopefully) get the proposal in 100ms 6) followers ack the change 7) L gets the acks in 100ms 8) L commits the change (message to all followers) 9) A gets the commit in 100ms 10) A responds to client ( 1ms) write latency: 100 + 100 + 100 + 100 = 400ms Obviously keeping these writes small is also critical. Patrick Martin Waite wrote: Hi Ted, If the links do not work for us for zk, then they are unlikely to work with any other solution - such as trying to stretch Pacemaker or Red Hat Cluster with their multicast protocols across the links. If the links are not good enough, we might have to spend some more money to fix this. regards, Martin On 8 March 2010 02:14, Ted Dunning ted.dunn...@gmail.com wrote: If you can stand the latency for updates then zk should work well for you. It is unlikely that you will be able to better than zk does and still maintain correctness. Do note that you can, probalbly bias client to use a local server. That should make things more efficient. Sent from my iPhone On Mar 7, 2010, at 3:00 PM, Mahadev Konar maha...@yahoo-inc.com wrote: The inter-site links are a nuisance. We have two data-centres with 100Mb links which I hope would be good enough for most uses, but we need a 3rd site - and currently that only has 2Mb links to the other sites. This might be a problem.
Re: Managing multi-site clusters with Zookeeper
That's controlled by the tickTime/synclimit/initlimit/etc.. see more about this in the admin guide: http://bit.ly/c726DC You'll want to increase from the defaults since those are typically for high performance interconnect (ie within colo). You are correct though, much will depend on your env. and some tuning will be involved. Patrick Martin Waite wrote: Hi Patrick, Thanks for you input. I am planning on having 3 zk servers per data centre, with perhaps only 2 in the tie-breaker site. The traffic between zk and the applications will be lots of local reads - who is the primary database ?. Changes to the config will be rare (server rebuilds, etc - ie. planned changes) or caused by server / network / site failure. The interesting thing in my mind is how zookeeper will cope with inter-site link failure - how quickly the remote sites will notice, and how quickly normality can be resumed when the link reappears. I need to get this running in the lab and start pulling out wires. regards, Martin On 8 March 2010 17:39, Patrick Hunt ph...@apache.org wrote: IMO latency is the primary issue you will face, but also keep in mind reliability w/in a colo. Say you have 3 colos (obv can't be 2), if you only have 3 servers, one in each colo, you will be reliable but clients w/in each colo will have to connect to a remote colo if the local fails. You will want to prioritize the local colo given that reads can be serviced entirely local that way. If you have 7 servers (2-2-3) that would be better - if a local server fails you have a redundant, if both fail then you go remote. You want to keep your writes as few as possible and as small as possible? Why? Say you have 100ms latency btw colos, let's go through a scenario for a client in a colo where the local servers are not the leader (zk cluster leader). read: 1) client reads a znode from local server 2) local server (usually 1ms if in colo comm) responds in 1ms write: 1) client writes a znode to local server A 2) A proposes change to the ZK Leader (L) in remote colo 3) L gets the proposal in 100ms 4) L proposes the change to all followers 5) all followers (not exactly, but hopefully) get the proposal in 100ms 6) followers ack the change 7) L gets the acks in 100ms 8) L commits the change (message to all followers) 9) A gets the commit in 100ms 10) A responds to client ( 1ms) write latency: 100 + 100 + 100 + 100 = 400ms Obviously keeping these writes small is also critical. Patrick Martin Waite wrote: Hi Ted, If the links do not work for us for zk, then they are unlikely to work with any other solution - such as trying to stretch Pacemaker or Red Hat Cluster with their multicast protocols across the links. If the links are not good enough, we might have to spend some more money to fix this. regards, Martin On 8 March 2010 02:14, Ted Dunning ted.dunn...@gmail.com wrote: If you can stand the latency for updates then zk should work well for you. It is unlikely that you will be able to better than zk does and still maintain correctness. Do note that you can, probalbly bias client to use a local server. That should make things more efficient. Sent from my iPhone On Mar 7, 2010, at 3:00 PM, Mahadev Konar maha...@yahoo-inc.com wrote: The inter-site links are a nuisance. We have two data-centres with 100Mb links which I hope would be good enough for most uses, but we need a 3rd site - and currently that only has 2Mb links to the other sites. This might be a problem.
Re: Managing multi-site clusters with Zookeeper
Hi Martin, As Ted rightly mentions that ZooKeeper usually is run within a colo because of the low latency requirements of applications that it supports. Its definitely reasnoble to use it in a multi data center environments but you should realize the implications of it. The high latency/low throughput means that you should make minimal use of such a ZooKeeper ensemble. Also, there are things like the tick Time, the syncLimit and others (setup parameters for ZooKeeper in config) which you will need to tune a little to get ZooKeeper running without many hiccups in this environment. Thanks mahadev On 3/6/10 10:29 AM, Ted Dunning ted.dunn...@gmail.com wrote: What you describe is relatively reasonable, even though Zookeeper is not normally distributed across multiple data centers with all members getting full votes. If you account for the limited throughput that this will impose on your applications that use ZK, then I think that this can work well. Probably, you would have local ZK clusters for higher transaction rate applications. You should also consider very carefully whether having multiple data centers increases or decreases your overall reliability. Unless you design very carefully, this will normally substantially degrade reliability. Making sure that it increases reliability is a really big task that involves a lot of surprising (it was to me) considerations and considerable hardware and time investments. Good luck! On Sat, Mar 6, 2010 at 1:50 AM, Martin Waite waite@googlemail.comwrote: Is this a viable approach, or am I taking Zookeeper out of its application domain and just asking for trouble ?
Re: Managing multi-site clusters with Zookeeper
Hi Mahadev, The inter-site links are a nuisance. We have two data-centres with 100Mb links which I hope would be good enough for most uses, but we need a 3rd site - and currently that only has 2Mb links to the other sites. This might be a problem. The ensemble would have a lot of read traffic from applications asking which database to connect to for each transaction - which presumably would be mostly handled by local zookeeper servers (do we call these nodes as opposed to znodes ?). The write traffic would be mostly changes to configuration (a rare event), and changes in the health of database servers - also hopefully rare. I suppose the main concern is how much ambient zookeeper system chatter will cross the links. Are there any measurements of how much traffic is used by zookeeper in maintaining the ensemble ? Another question that occurs is whether I can link sites A,B, and C in a ring - so that if any one site drops out, the remaining 2 continue to talk. I suppose that if the zookeeper servers are all in direct contact with each other, this issue does not exist. regards, Martin On 7 March 2010 21:43, Mahadev Konar maha...@yahoo-inc.com wrote: Hi Martin, As Ted rightly mentions that ZooKeeper usually is run within a colo because of the low latency requirements of applications that it supports. Its definitely reasnoble to use it in a multi data center environments but you should realize the implications of it. The high latency/low throughput means that you should make minimal use of such a ZooKeeper ensemble. Also, there are things like the tick Time, the syncLimit and others (setup parameters for ZooKeeper in config) which you will need to tune a little to get ZooKeeper running without many hiccups in this environment. Thanks mahadev On 3/6/10 10:29 AM, Ted Dunning ted.dunn...@gmail.com wrote: What you describe is relatively reasonable, even though Zookeeper is not normally distributed across multiple data centers with all members getting full votes. If you account for the limited throughput that this will impose on your applications that use ZK, then I think that this can work well. Probably, you would have local ZK clusters for higher transaction rate applications. You should also consider very carefully whether having multiple data centers increases or decreases your overall reliability. Unless you design very carefully, this will normally substantially degrade reliability. Making sure that it increases reliability is a really big task that involves a lot of surprising (it was to me) considerations and considerable hardware and time investments. Good luck! On Sat, Mar 6, 2010 at 1:50 AM, Martin Waite waite@googlemail.com wrote: Is this a viable approach, or am I taking Zookeeper out of its application domain and just asking for trouble ?
Re: Managing multi-site clusters with Zookeeper
Martin, 2Mb link might certainly be a problem. We can refer to these nodes as ZooKeeper servers. Znodes is used to data elements in the ZooKeeper data tree. The Zookeeper ensemble has minimal traffic which is basically health checks between the members of the ensemble. We call one of the members as Leader who is leading the ensemble and the others as Followers. The Leader does periodic health checks to see if the Followers are doing fine. This is of the order of 1KB/sec. There is some traffic when the leader election within the ensemble happens. This might be of the order of 1-2KB/sec. As you mentioned the reads happen locally. So, a good enough link within the ensemble members is important so that these followers can be up to date with the Leader. But again looking at your config, looks like its mostly read only traffic. One more thing you should be aware of: Lets says a ephemeral node was created and the client died, then the clients connected to the slow ZooKeeper server (with 2Mb/s links) would lag behind the other clients connected to the other servers. As per my opinion you should do some testing since 2Mb/sec seems a little dodgy. Thanks mahadev On 3/7/10 2:09 PM, Martin Waite waite@googlemail.com wrote: Hi Mahadev, The inter-site links are a nuisance. We have two data-centres with 100Mb links which I hope would be good enough for most uses, but we need a 3rd site - and currently that only has 2Mb links to the other sites. This might be a problem. The ensemble would have a lot of read traffic from applications asking which database to connect to for each transaction - which presumably would be mostly handled by local zookeeper servers (do we call these nodes as opposed to znodes ?). The write traffic would be mostly changes to configuration (a rare event), and changes in the health of database servers - also hopefully rare. I suppose the main concern is how much ambient zookeeper system chatter will cross the links. Are there any measurements of how much traffic is used by zookeeper in maintaining the ensemble ? Another question that occurs is whether I can link sites A,B, and C in a ring - so that if any one site drops out, the remaining 2 continue to talk. I suppose that if the zookeeper servers are all in direct contact with each other, this issue does not exist. regards, Martin On 7 March 2010 21:43, Mahadev Konar maha...@yahoo-inc.com wrote: Hi Martin, As Ted rightly mentions that ZooKeeper usually is run within a colo because of the low latency requirements of applications that it supports. Its definitely reasnoble to use it in a multi data center environments but you should realize the implications of it. The high latency/low throughput means that you should make minimal use of such a ZooKeeper ensemble. Also, there are things like the tick Time, the syncLimit and others (setup parameters for ZooKeeper in config) which you will need to tune a little to get ZooKeeper running without many hiccups in this environment. Thanks mahadev On 3/6/10 10:29 AM, Ted Dunning ted.dunn...@gmail.com wrote: What you describe is relatively reasonable, even though Zookeeper is not normally distributed across multiple data centers with all members getting full votes. If you account for the limited throughput that this will impose on your applications that use ZK, then I think that this can work well. Probably, you would have local ZK clusters for higher transaction rate applications. You should also consider very carefully whether having multiple data centers increases or decreases your overall reliability. Unless you design very carefully, this will normally substantially degrade reliability. Making sure that it increases reliability is a really big task that involves a lot of surprising (it was to me) considerations and considerable hardware and time investments. Good luck! On Sat, Mar 6, 2010 at 1:50 AM, Martin Waite waite@googlemail.com wrote: Is this a viable approach, or am I taking Zookeeper out of its application domain and just asking for trouble ?
Re: Managing multi-site clusters with Zookeeper
If you can stand the latency for updates then zk should work well for you. It is unlikely that you will be able to better than zk does and still maintain correctness. Do note that you can, probalbly bias client to use a local server. That should make things more efficient. Sent from my iPhone On Mar 7, 2010, at 3:00 PM, Mahadev Konar maha...@yahoo-inc.com wrote: The inter-site links are a nuisance. We have two data-centres with 100Mb links which I hope would be good enough for most uses, but we need a 3rd site - and currently that only has 2Mb links to the other sites. This might be a problem.
Managing multi-site clusters with Zookeeper
Hi, We're attempting to build a multi-site cluster: 1. web-tier and application tier is active in all sites 2. only one database is active at a time- normally in the designated primary site We want to use 3 sites to maintain a quorum. So, if the Primary site loses sight of both of the other sites, it will close down itself down. If the other sites both lose sight of the Primary site, they will co-operate in electing one of the pair as the new primary, and bring up the database services. I am thinking that in each site, a number of sentinel processes could hold open ephemeral znodes flagging that the site is up - with names like site1/sentinel-1. These sentinels could be plugged into local health monitoring, and when the site falls into dis-repair, remove themselves. If links between sites fail, then the ephemeral nodes would disappear too. Each site would have a process that periodically checks the presence of the sentinel znodes of the other sites. If all disappear, then the site knows it is in a minority partition, and shuts down services as required. Is this a viable approach, or am I taking Zookeeper out of its application domain and just asking for trouble ? regards, Martin
Re: Managing multi-site clusters with Zookeeper
What you describe is relatively reasonable, even though Zookeeper is not normally distributed across multiple data centers with all members getting full votes. If you account for the limited throughput that this will impose on your applications that use ZK, then I think that this can work well. Probably, you would have local ZK clusters for higher transaction rate applications. You should also consider very carefully whether having multiple data centers increases or decreases your overall reliability. Unless you design very carefully, this will normally substantially degrade reliability. Making sure that it increases reliability is a really big task that involves a lot of surprising (it was to me) considerations and considerable hardware and time investments. Good luck! On Sat, Mar 6, 2010 at 1:50 AM, Martin Waite waite@googlemail.comwrote: Is this a viable approach, or am I taking Zookeeper out of its application domain and just asking for trouble ? -- Ted Dunning, CTO DeepDyve
Re: Managing multi-site clusters with Zookeeper
I take your point about reliability, but I have no option other than finding a multi-site solution. Unfortunately, in my experience sites are much less reliable than individual machines, and so in a way coping with site failure is more important than individual machine failure. I imagine that the risk profile changes according to the number of machines you have, however. Thanks for the input Martin On 6 March 2010 18:29, Ted Dunning ted.dunn...@gmail.com wrote: What you describe is relatively reasonable, even though Zookeeper is not normally distributed across multiple data centers with all members getting full votes. If you account for the limited throughput that this will impose on your applications that use ZK, then I think that this can work well. Probably, you would have local ZK clusters for higher transaction rate applications. You should also consider very carefully whether having multiple data centers increases or decreases your overall reliability. Unless you design very carefully, this will normally substantially degrade reliability. Making sure that it increases reliability is a really big task that involves a lot of surprising (it was to me) considerations and considerable hardware and time investments. Good luck! On Sat, Mar 6, 2010 at 1:50 AM, Martin Waite waite@googlemail.com wrote: Is this a viable approach, or am I taking Zookeeper out of its application domain and just asking for trouble ? -- Ted Dunning, CTO DeepDyve