Keep in mind that mesos is designed to "fail fast". So when there are problems (such as losing connectivity to the resolved ZooKeeper IP) the daemon(s) (master & slave) die.
Due to this design, we are all supposed to run the mesos daemons under "supervision", which means auto-restart after they crash. This can be done with monit/god/runit/etc. So, to perform maintenance on ZooKeeper, I would firstly ensure the mesos-master processes are running under "supervision" so that they restart quickly after a ZK connectivity failure occurs. Then proceed with standard ZooKeeper maintenance (exhibitor-based or manual), pausing between downing of ZK servers to ensure you have "enough" mesos-master processes running. (I *would* say a "pausing until you have a quorum of mesos-masters up", but if you only have 2 of 3 up and then take down the ZK that the leader is connected to, that would be temporarily bad. So I'd make sure they're all up.) - Erik On Mon, Nov 9, 2015 at 11:07 PM, Marco Massenzio <ma...@mesosphere.io> wrote: > The way I would do it in a production cluster would be *not* to use > directly IP addresses for the ZK ensemble, but instead rely on some form of > internal DNS and use internally-resolvable hostnames (eg, {zk1, zk2, ...}. > prod.example.com etc) and have the provisioning tooling (Chef, Puppet, > Ansible, what have you) handle the setting of the hostname when > restarting/replacing a failing/crashed ZK server. > > This way your list of zk's to Mesos never changes, even though the FQN's > will map to different IPs / VMs. > > Obviously, this may not be always desirable / feasible (eg, if your prod > environment does not support DNS resolution). > > You are correct in that Mesos does not currently support dynamically > changing the ZK's addresses, but I don't know whether that's a limitation > of Mesos code or of the ZK C++ client driver. > I'll look into it and let you know what I find (if anything). > > -- > *Marco Massenzio* > Distributed Systems Engineer > http://codetrips.com > > On Mon, Nov 9, 2015 at 6:01 AM, Donald Laidlaw <donlaid...@me.com> wrote: > >> How do mesos masters and slaves react to zookeeper cluster changes? When >> the masters and slaves start they are given a set of addresses to connect >> to zookeeper. But over time, one of those zookeepers fails, and is replaced >> by a new server at a new address. How should this be handled in the mesos >> servers? >> >> I am guessing that mesos does not automatically detect and react to that >> change. But obviously we should do something to keep the mesos servers >> happy as well. What should be do? >> >> The obvious thing is to stop the mesos servers, one at a time, and >> restart them with the new configuration. But it would be really nice to be >> able to do this dynamically without restarting the server. After all, >> coordinating a rolling restart is a fairly hard job. >> >> Any suggestions or pointers? >> >> Best regards, >> Don Laidlaw >> >> >> >