I'd like to initiate a discussion on the following issue:

https://issues.apache.org/jira/browse/MESOS-4581

I've included a lot of detail in the JIRA, and would rather not reiterate _all_ 
of it here on the list, but in short:

We are experiencing an issue when launching docker containers from marathon on 
mesos, where the container actually starts on the slave node to which it's 
assigned, but mesos/marathon get stuck in staging/staged respectively until the 
task launch times out and system tries again to launch it elsewhere. This issue 
is random in nature, successfully starting tasks about 40-50% of the time, 
while the rest of the time getting stuck.

We've been able to narrow this down to a possible race condition likely in 
docker itself, but being triggered by the mesos-docker-executor. I have written 
and tested a patch in our environment which seems to have eliminated the issue, 
however I feel that the patch could be made more robust, and is currently just 
a work-around.

Thanks for your time and consideration of the issue.

Travis

Reply via email to