Thanks Jojy, I agree that my current patch is not the most ideal solution, as it was intended as a proof of concept to ensure that it was in fact the timing that was causing the problems I've been experiencing. Perhaps I didn't make that clear enough in the JIRA. That is also the reason I haven't attempted to submit that patch back.
I was hoping to open a dialog to discuss the most appropriate fix for that error, as I am relatively new to the future/promise paradigm, and I haven't done much in c++ in 10 years. I'm assuming that when you say "using 'after' on the future" you are referring to the initial inspect call's future correct? In that case, I'm not convinced that is the right solution either. If you dig into the inspect call, you'll see that it has a fairly sophisticated retry system itself. However, that retry system is designed around whether the actual inspect command completes its execution. If the inspect command runs forever, then that thread blocks waiting for it forever. I am leaning towards a combination of both. I believe that it is appropriate to wait a small amount of time to call the initial inspect, to give docker a chance to create and start the container. I see your point in that there is still the possibility of a race condition, if the run command is delayed for the right amount of time itself, I understand that the run thread could be executed any time after the future is created. When running the vanilla mesos (without my current patch), remember that the containers that launch successfully, actually do so because the initial inspect fails (returns non 0 as the container doesn’t exist yet), and the inspect function's built in retry system delays 500ms and runs again, this time successfully. (See these logs to see what I mean: https://gist.github.com/travishegner/dcb5797f5a0919e45618) With that in mind, we wouldn't actually know whether the inspect is just delayed in its own retry pattern, or is actually hung, unless we wait a large enough amount of time. Given how frequently we encounter this issue in our environment (more than half of the attempted container launches), waiting a large amount of time to indicate whether the container is running would be undesirable. How do you feel about an initial delay (how long 500ms?), in concert with an ".after()" (how long 3000ms?) on the inspect future that will discard the original inspect and try again? The initial delay would give docker time to create and start our container, and in my environment drastically reduce the frequency with which we are met with this condition. Ironically, the variable name (DOCKER_INSPECT_DELAY) of the retry delay, seems to indicate that the author expected a delay to be inserted before the inspect call anyway, but in reading through the code, that variable is only used as the retry interval, and the initial docker inspect is executed immediately. A timeout/retry on the inspect call helps to guard against this type of issue, should the run and inspect commands be launched simultaneously for whatever reason. A final thought: are we looking at this from the wrong perspective? Should the inspect call itself be modified to handle the case when the inspect command never returns? Thanks for your thoughts, Travis -----Original Message----- From: Jojy Varghese [mailto:j...@mesosphere.io] Sent: Tuesday, February 2, 2016 10:34 PM To: dev@mesos.apache.org Subject: Re: MESOS-4581 Docker containers getting stuck in staging. Hi Travis Thanks for narrowing down the issue. I had a brief look at your patch and it looks like it relies on adding delay before inspect is called. Although that might work mostly, I am wondering if that is the right solution. It would be better if we can have a timeout (using ‘after’ on the future) and retry inspect after timeout. We will have to discard the inspect future thats in flight. -Jojy > On Feb 2, 2016, at 1:12 PM, Hegner, Travis <theg...@trilliumit.com> wrote: > > I'd like to initiate a discussion on the following issue: > > https://issues.apache.org/jira/browse/MESOS-4581 > > I've included a lot of detail in the JIRA, and would rather not reiterate > _all_ of it here on the list, but in short: > > We are experiencing an issue when launching docker containers from marathon > on mesos, where the container actually starts on the slave node to which it's > assigned, but mesos/marathon get stuck in staging/staged respectively until > the task launch times out and system tries again to launch it elsewhere. This > issue is random in nature, successfully starting tasks about 40-50% of the > time, while the rest of the time getting stuck. > > We've been able to narrow this down to a possible race condition likely in > docker itself, but being triggered by the mesos-docker-executor. I have > written and tested a patch in our environment which seems to have eliminated > the issue, however I feel that the patch could be made more robust, and is > currently just a work-around. > > Thanks for your time and consideration of the issue. > > Travis