Keep in mind that mesos is designed to "fail fast".  So when there are
problems (such as losing connectivity to the resolved ZooKeeper IP) the
daemon(s) (master & slave) die.

Due to this design, we are all supposed to run the mesos daemons under
"supervision", which means auto-restart after they crash.  This can be done
with monit/god/runit/etc.

So, to perform maintenance on ZooKeeper, I would firstly ensure the
mesos-master processes are running under "supervision" so that they restart
quickly after a ZK connectivity failure occurs.  Then proceed with standard
ZooKeeper maintenance (exhibitor-based or manual), pausing between downing
of ZK servers to ensure you have "enough" mesos-master processes running.
 (I *would* say a "pausing until you have a quorum of mesos-masters up",
but if you only have 2 of 3 up and then take down the ZK that the leader is
connected to, that would be temporarily bad.  So I'd make sure they're all
up.)

- Erik

On Mon, Nov 9, 2015 at 11:07 PM, Marco Massenzio <ma...@mesosphere.io>
wrote:

> The way I would do it in a production cluster would be *not* to use
> directly IP addresses for the ZK ensemble, but instead rely on some form of
> internal DNS and use internally-resolvable hostnames (eg, {zk1, zk2, ...}.
> prod.example.com etc) and have the provisioning tooling (Chef, Puppet,
> Ansible, what have you) handle the setting of the hostname when
> restarting/replacing a failing/crashed ZK server.
>
> This way your list of zk's to Mesos never changes, even though the FQN's
> will map to different IPs / VMs.
>
> Obviously, this may not be always desirable / feasible (eg, if your prod
> environment does not support DNS resolution).
>
> You are correct in that Mesos does not currently support dynamically
> changing the ZK's addresses, but I don't know whether that's a limitation
> of Mesos code or of the ZK C++ client driver.
> I'll look into it and let you know what I find (if anything).
>
> --
> *Marco Massenzio*
> Distributed Systems Engineer
> http://codetrips.com
>
> On Mon, Nov 9, 2015 at 6:01 AM, Donald Laidlaw <donlaid...@me.com> wrote:
>
>> How do mesos masters and slaves react to zookeeper cluster changes? When
>> the masters and slaves start they are given a set of addresses to connect
>> to zookeeper. But over time, one of those zookeepers fails, and is replaced
>> by a new server at a new address. How should this be handled in the mesos
>> servers?
>>
>> I am guessing that mesos does not automatically detect and react to that
>> change. But obviously we should do something to keep the mesos servers
>> happy as well. What should be do?
>>
>> The obvious thing is to stop the mesos servers, one at a time, and
>> restart them with the new configuration. But it would be really nice to be
>> able to do this dynamically without restarting the server. After all,
>> coordinating a rolling restart is a fairly hard job.
>>
>> Any suggestions or pointers?
>>
>> Best regards,
>> Don Laidlaw
>>
>>
>>
>

Reply via email to