Hi Travis Thanks for narrowing down the issue. I had a brief look at your patch and it looks like it relies on adding delay before inspect is called. Although that might work mostly, I am wondering if that is the right solution. It would be better if we can have a timeout (using ‘after’ on the future) and retry inspect after timeout. We will have to discard the inspect future thats in flight.
-Jojy > On Feb 2, 2016, at 1:12 PM, Hegner, Travis <theg...@trilliumit.com> wrote: > > I'd like to initiate a discussion on the following issue: > > https://issues.apache.org/jira/browse/MESOS-4581 > > I've included a lot of detail in the JIRA, and would rather not reiterate > _all_ of it here on the list, but in short: > > We are experiencing an issue when launching docker containers from marathon > on mesos, where the container actually starts on the slave node to which it's > assigned, but mesos/marathon get stuck in staging/staged respectively until > the task launch times out and system tries again to launch it elsewhere. This > issue is random in nature, successfully starting tasks about 40-50% of the > time, while the rest of the time getting stuck. > > We've been able to narrow this down to a possible race condition likely in > docker itself, but being triggered by the mesos-docker-executor. I have > written and tested a patch in our environment which seems to have eliminated > the issue, however I feel that the patch could be made more robust, and is > currently just a work-around. > > Thanks for your time and consideration of the issue. > > Travis