Ah thanks. Why does it take 2 minutes to change configuration, are you doing it manually? Scary :)
On Mon, Dec 1, 2014 at 10:43 AM, Geoffroy Jabouley < geoffroy.jabou...@gmail.com> wrote: > Hello > > the idea is to be able of tuning the mesos slave configuration > (attributes, resources offers, general options, ... upgrades?) without > altering the current tasks running on this mesos slave (a dockerized > jenkins instance + docker jenkins slaves for example). > > I am setting up a test cluster with latest mesos/marathon releases, to > check if behaviors are identicals > > > 2014-12-01 19:28 GMT+01:00 Benjamin Mahler <benjamin.mah...@gmail.com>: > >> > I would like to be able to shutdown a mesos-slave for maintenance >> without altering the current tasks. >> >> What are you trying to do? If your maintenance operation does not affect >> the tasks, why do you need to stop the slave in the first place? >> >> On Wed, Nov 26, 2014 at 1:36 AM, Geoffroy Jabouley < >> geoffroy.jabou...@gmail.com> wrote: >> >>> Hello all >>> >>> thanks for your answers. >>> >>> Is there a way of configuring this 75s timeout for slave reconnection? >>> >>> I think that my problem is that as the task status is lost: >>> - marathon framework detects the loss and start another instance >>> - mesos-slave, when restarting, detects the lost task and restart a new >>> one >>> >>> ==> 2 tasks on mesos cluster, 2 running docker containers, 1 app >>> instance in marathon >>> >>> >>> So a solution would be to extend the 75s timeout. I thought that my >>> command lines for starting the cluster were fine, but it seems incomplete... >>> >>> I would like to be able to shutdown a mesos-slave for maintenance >>> without altering the current tasks. >>> >>> 2014-11-25 18:30 GMT+01:00 Connor Doyle <con...@mesosphere.io>: >>> >>>> Hi Geoffroy, >>>> >>>> For the Marathon instances, in all released version of Marathon you >>>> must supply the --checkpoint flag to turn on task checkpointing for the >>>> framework. We've changed the default to true starting with the next >>>> release. >>>> >>>> There is a bug in Mesos where the FrameworkInfo does not get updated >>>> when a framework re-registers. This means that if you shut down Marathon >>>> and restart it with --checkpoint, the Mesos master (with the same >>>> FrameworkId, which Marathon picks up from ZK) will ignore the new setting. >>>> For reference, here is the design doc to address that: >>>> https://cwiki.apache.org/confluence/display/MESOS/Design+doc%3A+Updating+Framework+Info >>>> >>>> Fortunately, there is an easy workaround. >>>> >>>> 1) Shut down Marathon (tasks keep running) >>>> 2) Restart the leading Mesos master (tasks keep running) >>>> 3) Start Marathon with --checkpoint enabled >>>> >>>> This works by clearing the Mesos master's in-memory state. It is >>>> rebuilt as the slave nodes and frameworks re-register. >>>> >>>> Please report back if this doesn't solve the issue for you. >>>> -- >>>> Connor >>>> >>>> >>>> > On Nov 25, 2014, at 07:43, Geoffroy Jabouley < >>>> geoffroy.jabou...@gmail.com> wrote: >>>> > >>>> > Hello >>>> > >>>> > i am currently trying to activate checkpointing for my Mesos cloud. >>>> > >>>> > Starting from an application running in a docker container on the >>>> cluster, launched from marathon, my use cases are the followings: >>>> > >>>> > UC1: kill the marathon service, then restart after 2 minutes. >>>> > Expected: the mesos task is still active, the docker container is >>>> running. When the marathon service restarts, it get backs its tasks. >>>> > >>>> > Result: OK >>>> > >>>> > >>>> > UC2: kill the mesos slave, then restart after 2 minutes. >>>> > Expected: the mesos task remains active, the docker container is >>>> running. When the mesos slave service restarts, it get backs its tasks. >>>> Marathon does not show error. >>>> > >>>> > Results: task get status LOST when slave is killed. Docker container >>>> still running. Marathon detects the application went down and spawn a new >>>> one on another available mesos slave. When the slave restarts, it kills the >>>> previous running container and start a new one. So i end up with 2 >>>> applications on my cluster, one spawn by Marathon, and another orphan one. >>>> > >>>> > >>>> > Is this behavior normal? Can you please explain what i am doing wrong? >>>> > >>>> > >>>> ----------------------------------------------------------------------------------------------------------- >>>> > >>>> > Here is the configuration i have come so far: >>>> > Mesos 0.19.1 (not dockerized) >>>> > Marathon 0.6.1 (not dockerized) >>>> > Docker 1.3 + Deimos 0.4.2 >>>> > >>>> > Mesos master is started: >>>> > /usr/local/sbin/mesos-master --zk=zk://...:2181/mesos --port=5050 >>>> --log_dir=/var/log/mesos --cluster=CLUSTER_POC --hostname=... --ip=... >>>> --quorum=1 --work_dir=/var/lib/mesos >>>> > >>>> > Mesos slave is started: >>>> > /usr/local/sbin/mesos-slave --master=zk://...:2181/mesos >>>> --log_dir=/var/log/mesos --checkpoint=true >>>> --containerizer_path=/usr/local/bin/deimos >>>> --executor_registration_timeout=5mins --hostname=... --ip=... >>>> --isolation=external --recover=reconnect --recovery_timeout=120mins >>>> --strict=true >>>> > >>>> > Marathon is started: >>>> > java -Xmx512m -Djava.library.path=/usr/local/lib >>>> -Djava.util.logging.SimpleFormatter.format=%2$s %5$s%6$s%n -cp >>>> /usr/local/bin/marathon mesosphere.marathon.Main --zk >>>> zk://...:2181/marathon --master zk://...:2181/mesos --local_port_min 30000 >>>> --hostname ... --event_subscriber http_callback --http_port 8080 >>>> --task_launch_timeout 300000 --local_port_max 40000 --ha --checkpoint >>>> > >>>> > >>>> > >>>> > >>>> >>>> >>> >> >