Re: Task Checkpointing with Mesos, Marathon and Docker containers

Benjamin Mahler Mon, 01 Dec 2014 10:55:08 -0800

Ah thanks. Why does it take 2 minutes to change configuration, are you
doing it manually? Scary :)


On Mon, Dec 1, 2014 at 10:43 AM, Geoffroy Jabouley <
geoffroy.jabou...@gmail.com> wrote:

> Hello
>
> the idea is to be able of tuning the mesos slave configuration
> (attributes, resources offers, general options, ... upgrades?) without
> altering the current tasks running on this mesos slave (a dockerized
> jenkins instance + docker jenkins slaves for example).
>
> I am setting up a test cluster with latest mesos/marathon releases, to
> check if behaviors are identicals
>
>
> 2014-12-01 19:28 GMT+01:00 Benjamin Mahler <benjamin.mah...@gmail.com>:
>
>> > I would like to be able to shutdown a mesos-slave for maintenance
>> without altering the current tasks.
>>
>> What are you trying to do? If your maintenance operation does not affect
>> the tasks, why do you need to stop the slave in the first place?
>>
>> On Wed, Nov 26, 2014 at 1:36 AM, Geoffroy Jabouley <
>> geoffroy.jabou...@gmail.com> wrote:
>>
>>> Hello all
>>>
>>> thanks for your answers.
>>>
>>> Is there a way of configuring this 75s timeout for slave reconnection?
>>>
>>> I think that my problem is that as the task status is lost:
>>> - marathon framework detects the loss and start another instance
>>> - mesos-slave, when restarting, detects the lost task and restart a new
>>> one
>>>
>>> ==> 2 tasks on mesos cluster, 2 running docker containers, 1 app
>>> instance in marathon
>>>
>>>
>>> So a solution would be to extend the 75s timeout. I thought that my
>>> command lines for starting the cluster were fine, but it seems incomplete...
>>>
>>> I would like to be able to shutdown a mesos-slave for maintenance
>>> without altering the current tasks.
>>>
>>> 2014-11-25 18:30 GMT+01:00 Connor Doyle <con...@mesosphere.io>:
>>>
>>>> Hi Geoffroy,
>>>>
>>>> For the Marathon instances, in all released version of Marathon you
>>>> must supply the --checkpoint flag to turn on task checkpointing for the
>>>> framework.  We've changed the default to true starting with the next
>>>> release.
>>>>
>>>> There is a bug in Mesos where the FrameworkInfo does not get updated
>>>> when a framework re-registers.  This means that if you shut down Marathon
>>>> and restart it with --checkpoint, the Mesos master (with the same
>>>> FrameworkId, which Marathon picks up from ZK) will ignore the new setting.
>>>> For reference, here is the design doc to address that:
>>>> https://cwiki.apache.org/confluence/display/MESOS/Design+doc%3A+Updating+Framework+Info
>>>>
>>>> Fortunately, there is an easy workaround.
>>>>
>>>> 1) Shut down Marathon (tasks keep running)
>>>> 2) Restart the leading Mesos master (tasks keep running)
>>>> 3) Start Marathon with --checkpoint enabled
>>>>
>>>> This works by clearing the Mesos master's in-memory state.  It is
>>>> rebuilt as the slave nodes and frameworks re-register.
>>>>
>>>> Please report back if this doesn't solve the issue for you.
>>>> --
>>>> Connor
>>>>
>>>>
>>>> > On Nov 25, 2014, at 07:43, Geoffroy Jabouley <
>>>> geoffroy.jabou...@gmail.com> wrote:
>>>> >
>>>> > Hello
>>>> >
>>>> > i am currently trying to activate checkpointing for my Mesos cloud.
>>>> >
>>>> > Starting from an application running in a docker container on the
>>>> cluster, launched from marathon, my use cases are the followings:
>>>> >
>>>> > UC1: kill the marathon service, then restart after 2 minutes.
>>>> > Expected: the mesos task is still active, the docker container is
>>>> running. When the marathon service restarts, it get backs its tasks.
>>>> >
>>>> > Result: OK
>>>> >
>>>> >
>>>> > UC2: kill the mesos slave, then restart after 2 minutes.
>>>> > Expected: the mesos task remains active, the docker container is
>>>> running. When the mesos slave service restarts, it get backs its tasks.
>>>> Marathon does not show error.
>>>> >
>>>> > Results: task get status LOST when slave is killed. Docker container
>>>> still running.  Marathon detects the application went down and spawn a new
>>>> one on another available mesos slave. When the slave restarts, it kills the
>>>> previous running container and start a new one. So i end up with 2
>>>> applications on my cluster, one spawn by Marathon, and another orphan one.
>>>> >
>>>> >
>>>> > Is this behavior normal? Can you please explain what i am doing wrong?
>>>> >
>>>> >
>>>> -----------------------------------------------------------------------------------------------------------
>>>> >
>>>> > Here is the configuration i have come so far:
>>>> > Mesos 0.19.1 (not dockerized)
>>>> > Marathon 0.6.1 (not dockerized)
>>>> > Docker 1.3 + Deimos 0.4.2
>>>> >
>>>> > Mesos master is started:
>>>> > /usr/local/sbin/mesos-master --zk=zk://...:2181/mesos --port=5050
>>>> --log_dir=/var/log/mesos --cluster=CLUSTER_POC --hostname=... --ip=...
>>>> --quorum=1 --work_dir=/var/lib/mesos
>>>> >
>>>> > Mesos slave is started:
>>>> > /usr/local/sbin/mesos-slave --master=zk://...:2181/mesos
>>>> --log_dir=/var/log/mesos --checkpoint=true
>>>> --containerizer_path=/usr/local/bin/deimos
>>>> --executor_registration_timeout=5mins --hostname=... --ip=...
>>>> --isolation=external --recover=reconnect --recovery_timeout=120mins
>>>> --strict=true
>>>> >
>>>> > Marathon is started:
>>>> > java -Xmx512m -Djava.library.path=/usr/local/lib
>>>> -Djava.util.logging.SimpleFormatter.format=%2$s %5$s%6$s%n -cp
>>>> /usr/local/bin/marathon mesosphere.marathon.Main --zk
>>>> zk://...:2181/marathon --master zk://...:2181/mesos --local_port_min 30000
>>>> --hostname ... --event_subscriber http_callback --http_port 8080
>>>> --task_launch_timeout 300000 --local_port_max 40000 --ha --checkpoint
>>>> >
>>>> >
>>>> >
>>>> >
>>>>
>>>>
>>>
>>
>

Re: Task Checkpointing with Mesos, Marathon and Docker containers

Reply via email to