We have live-migrated an entire cluster of 10s of thousands of Mesos Agents to point at a new ZK ensemble without killing our cluster <https://youtu.be/nNrh-gdu9m4?t=21m50s>, or the tasks were were running. (twice)
We started by shutting off all of the Mesos Masters. I’ve heard rumors that some people have found their Mesos Agents will kill themselves without a master, but this has never been my experience. If you find this to be the case, please reach out as I’d love to avoid that fate at all costs. Once the masters were down, we submitted a change to modify the configuration for the agents (we set up an automatic restart of the slave for some configuration values such as this one to make it easier to roll out). It took our current configuration management system the better part of an hour to get the change propagated across the cluster, but while that was happening, the agents were happily running, and user tasks were serving traffic. Once we saw zk_watch_count (check under the mntr command) <https://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html#sc_zkCommands> increase to the expected number of agents on the new ensemble, we turned on the masters (also pointing at the new ensemble now) and the agents sent status updates back to the masters. If you haven’t taken a look at zktraffic <https://github.com/twitter/zktraffic>, I’d recommend it for improved visibility into your ensemble as well. Please note- there’s a bug in the C ZooKeeper library <https://issues.apache.org/jira/browse/ZOOKEEPER-1998> where the members of an ensemble will only be resolved once. There isn’t conclusive proof that it affects the agents <https://issues.apache.org/jira/browse/MESOS-2681>. We were fine, but you may want to validate. > On Nov 10, 2015, at 11:23 AM, Donald Laidlaw <donlaid...@me.com> wrote: > > I agree, you want to apply the changes gradually so as not to lose a quorum. > The problem is automating this so that it happens in a lights-out > environment, in the cloud, without some poor slob's pager going off in the > middle of the night :) > > While health checks can detect and replace a dead server reliably on any > number of clouds, the new server comes up with a new IP address. This server > can reliably join into zookeeper ensemble. However, it is tough to automate > the rolling restart of the other mesos servers, both masters and slaves, that > needs to occur to keep them happy. > > One thing I have not tried is to just ignore the change, and use something to > detect the masters just prior to starting mesos. If they truly fail fast, > then if they lose a zookeeper connection, then maybe they don’t care that > they have been started with an out-of-date list of zookeeper servers. > > What does mesos-master and mesos-slave do with a list of zookeeper servers to > connect to? Just try them in order until one works, then use that one until > it fails? If so, and it fails fast, then letting it continue to run with a > stale list will have no ill effects. Or does it keep trying the servers in > the list when a connection fails? > > Don Laidlaw > > >> On Nov 10, 2015, at 4:42 AM, Erik Weathers <eweath...@groupon.com >> <mailto:eweath...@groupon.com>> wrote: >> >> Keep in mind that mesos is designed to "fail fast". So when there are >> problems (such as losing connectivity to the resolved ZooKeeper IP) the >> daemon(s) (master & slave) die. >> >> Due to this design, we are all supposed to run the mesos daemons under >> "supervision", which means auto-restart after they crash. This can be done >> with monit/god/runit/etc. >> >> So, to perform maintenance on ZooKeeper, I would firstly ensure the >> mesos-master processes are running under "supervision" so that they restart >> quickly after a ZK connectivity failure occurs. Then proceed with standard >> ZooKeeper maintenance (exhibitor-based or manual), pausing between downing >> of ZK servers to ensure you have "enough" mesos-master processes running. >> (I *would* say a "pausing until you have a quorum of mesos-masters up", but >> if you only have 2 of 3 up and then take down the ZK that the leader is >> connected to, that would be temporarily bad. So I'd make sure they're all >> up.) >> >> - Erik >> >> On Mon, Nov 9, 2015 at 11:07 PM, Marco Massenzio <ma...@mesosphere.io >> <mailto:ma...@mesosphere.io>> wrote: >> The way I would do it in a production cluster would be *not* to use directly >> IP addresses for the ZK ensemble, but instead rely on some form of internal >> DNS and use internally-resolvable hostnames (eg, {zk1, zk2, >> ...}.prod.example.com <http://prod.example.com/> etc) and have the >> provisioning tooling (Chef, Puppet, Ansible, what have you) handle the >> setting of the hostname when restarting/replacing a failing/crashed ZK >> server. >> >> This way your list of zk's to Mesos never changes, even though the FQN's >> will map to different IPs / VMs. >> >> Obviously, this may not be always desirable / feasible (eg, if your prod >> environment does not support DNS resolution). >> >> You are correct in that Mesos does not currently support dynamically >> changing the ZK's addresses, but I don't know whether that's a limitation of >> Mesos code or of the ZK C++ client driver. >> I'll look into it and let you know what I find (if anything). >> >> -- >> Marco Massenzio >> Distributed Systems Engineer >> http://codetrips.com <http://codetrips.com/> >> >> On Mon, Nov 9, 2015 at 6:01 AM, Donald Laidlaw <donlaid...@me.com >> <mailto:donlaid...@me.com>> wrote: >> How do mesos masters and slaves react to zookeeper cluster changes? When the >> masters and slaves start they are given a set of addresses to connect to >> zookeeper. But over time, one of those zookeepers fails, and is replaced by >> a new server at a new address. How should this be handled in the mesos >> servers? >> >> I am guessing that mesos does not automatically detect and react to that >> change. But obviously we should do something to keep the mesos servers happy >> as well. What should be do? >> >> The obvious thing is to stop the mesos servers, one at a time, and restart >> them with the new configuration. But it would be really nice to be able to >> do this dynamically without restarting the server. After all, coordinating a >> rolling restart is a fairly hard job. >> >> Any suggestions or pointers? >> >> Best regards, >> Don Laidlaw >> >> >> >> >