[ https://issues.apache.org/jira/browse/MESOS-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14522346#comment-14522346 ]
Raul Gutierrez Segales commented on MESOS-890: ---------------------------------------------- Well, thanks to [~yasumoto] we now know how to do this :-) > Figure out a way to migrate a live Mesos cluster to a different ZooKeeper > cluster > --------------------------------------------------------------------------------- > > Key: MESOS-890 > URL: https://issues.apache.org/jira/browse/MESOS-890 > Project: Mesos > Issue Type: Improvement > Components: master > Reporter: Raul Gutierrez Segales > > I've been chatting with [~vinodkone] about approaching a live ZK cluster > migration. Here are the options we came up with. > For the descriptions we treat `zk1` as the current working cluster, `obs` as > a bunch of ZooKeeper Observers [1] and `zk2` as the new cluster to which we > need to migrate. > Approach #1: Using Observers > With this option we need to: > * add obs to zk1 > * restart slaves to have them use obs to find their master > * restart the framework having it use obs to find the mesos master > * restart the mesos masters having them use obs to perform their election > * we then stop all ZK obs and remove their data (since they will need to sync > up with an entirely new cluster, we need to lose the old data) > * we restart ZK obs having them be part of zk2 > * at this point the slaves, the framework and the masters can reach the ZK > obs again and an election happens > * optionally you can restart slaves, the framework and masters again using > zk2 instead of the ZK obs if you wanted to decommission them. > This assumes that we can do the last three steps in << 75 secs (75 secs being > the slave health check timeout). This is a reasonable assumption if the data > size in zk2 is small enough to ensure that the ZK obs can sync up quickly > with zk2. If zk2 is a new cluster with no data then this should be very fast. > The good things of this approach are: > * no mesos code change > * it is very easy to rollback half way through, if need be > The hard issues are: > * Manipulating the ZK obs (i.e.: stopping, removing the data from zk1 and > starting again) needs to be done with care. Messing up configs or not > removing the data from zk1 on any of the ZK obs will cause problems > * we need to restart all slaves to have them use the ZK obs instead of > connecting to zk1 directly. But with slave recovery this isn't an issue, just > an extra step. > * same thing for the framework and the masters > Approach #2: Dual publishing from mesos masters > With this option we would augment the election handling code in mesos masters > to have it deal with the notion of a primary and secondary ZK clusters. > Master registration and election would then work as follows: > * create an ephemeral|sequential znode in zk1 (i.e.: > /path/to/znode/mesos_000023) > * create an ephemeral, but not sequential, znode in zk2 with the exact same > path as what was created in zk1 (i.e.: /path/to/znode/mesos_000023) > * make sure both sessions, in zk1 and zk2, are always in the same state > (i.e.: if one expires, the other one should be closed, etc.) > For now, lets omit a few implementation details which might need extra care > and assume we can make this work consistently in such a way that zk2 reflects > accurately elections that happen in zk1. This means that regardless of being > connected to zk1 or zk2, you always get the same master. Once we have this > the migration steps would be: > * restart slaves to have them use zk2 where masters can be found by virtue of > what we implemented above > * restart the framework so that it finds the mesos master in zk2 > * stop all mesos masters (they all need to be stopped before moving to the > next step) > * start all mesos masters using zk2 as its primary and only cluster > Again, this assumes we can do the last two steps in << 75 secs (or if we > needed to, we could bump the slave health check timeout). Which, again, > sounds achievable given that masters have no state and their start-up time is > very short. > The good things of this approach are: > - no tinkering with extra ZK servers nor with ZK configs > The hard issues are: > - extra code needs to be added to the election handling bits of mesos master > to address a very rare, but probable, use-case of cluster migration. It might > take a bit of time to get that code right. > - it's easier to end up with a bad state if any of the mesos masters ends up > with a bad config or is restarted earlier and ends up publishing differently > than the other masters. This could lead to elections with differing results. > Thoughts? > [1] http://zookeeper.apache.org/doc/trunk/zookeeperObservers.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)