Hi, Hegner. I think Joly means we change Docker::inspect from ``` s.get().status() .onAny([=]() { __inspect(cmd, promise, retryInterval, output, s.get()); }); ``` to ``` s.get().status() .after(timeoutSeconds, []() {...}) .onAny([=]() { __inspect(cmd, promise, retryInterval, output, s.get()); }); ```
Suppose if 500ms not works in different environments, we still would go into same problem. I think add timeout limit in inspect would be better. On Wed, Feb 3, 2016 at 10:05 PM, Hegner, Travis <theg...@trilliumit.com> wrote: > Thanks Jojy, > > I agree that my current patch is not the most ideal solution, as it was > intended as a proof of concept to ensure that it was in fact the timing > that was causing the problems I've been experiencing. Perhaps I didn't make > that clear enough in the JIRA. That is also the reason I haven't attempted > to submit that patch back. > > I was hoping to open a dialog to discuss the most appropriate fix for that > error, as I am relatively new to the future/promise paradigm, and I haven't > done much in c++ in 10 years. > > I'm assuming that when you say "using 'after' on the future" you are > referring to the initial inspect call's future correct? In that case, I'm > not convinced that is the right solution either. If you dig into the > inspect call, you'll see that it has a fairly sophisticated retry system > itself. However, that retry system is designed around whether the actual > inspect command completes its execution. If the inspect command runs > forever, then that thread blocks waiting for it forever. > > I am leaning towards a combination of both. I believe that it is > appropriate to wait a small amount of time to call the initial inspect, to > give docker a chance to create and start the container. I see your point in > that there is still the possibility of a race condition, if the run command > is delayed for the right amount of time itself, I understand that the run > thread could be executed any time after the future is created. When running > the vanilla mesos (without my current patch), remember that the containers > that launch successfully, actually do so because the initial inspect fails > (returns non 0 as the container doesn’t exist yet), and the inspect > function's built in retry system delays 500ms and runs again, this time > successfully. (See these logs to see what I mean: > https://gist.github.com/travishegner/dcb5797f5a0919e45618) > > With that in mind, we wouldn't actually know whether the inspect is just > delayed in its own retry pattern, or is actually hung, unless we wait a > large enough amount of time. Given how frequently we encounter this issue > in our environment (more than half of the attempted container launches), > waiting a large amount of time to indicate whether the container is running > would be undesirable. > > How do you feel about an initial delay (how long 500ms?), in concert with > an ".after()" (how long 3000ms?) on the inspect future that will discard > the original inspect and try again? The initial delay would give docker > time to create and start our container, and in my environment drastically > reduce the frequency with which we are met with this condition. Ironically, > the variable name (DOCKER_INSPECT_DELAY) of the retry delay, seems to > indicate that the author expected a delay to be inserted before the inspect > call anyway, but in reading through the code, that variable is only used as > the retry interval, and the initial docker inspect is executed immediately. > A timeout/retry on the inspect call helps to guard against this type of > issue, should the run and inspect commands be launched simultaneously for > whatever reason. > > A final thought: are we looking at this from the wrong perspective? Should > the inspect call itself be modified to handle the case when the inspect > command never returns? > > Thanks for your thoughts, > > Travis > > -----Original Message----- > From: Jojy Varghese [mailto:j...@mesosphere.io] > Sent: Tuesday, February 2, 2016 10:34 PM > To: dev@mesos.apache.org > Subject: Re: MESOS-4581 Docker containers getting stuck in staging. > > Hi Travis > Thanks for narrowing down the issue. I had a brief look at your patch and > it looks like it relies on adding delay before inspect is called. Although > that might work mostly, I am wondering if that is the right solution. It > would be better if we can have a timeout (using ‘after’ on the future) and > retry inspect after timeout. We will have to discard the inspect future > thats in flight. > > -Jojy > > > > On Feb 2, 2016, at 1:12 PM, Hegner, Travis <theg...@trilliumit.com> > wrote: > > > > I'd like to initiate a discussion on the following issue: > > > > https://issues.apache.org/jira/browse/MESOS-4581 > > > > I've included a lot of detail in the JIRA, and would rather not > reiterate _all_ of it here on the list, but in short: > > > > We are experiencing an issue when launching docker containers from > marathon on mesos, where the container actually starts on the slave node to > which it's assigned, but mesos/marathon get stuck in staging/staged > respectively until the task launch times out and system tries again to > launch it elsewhere. This issue is random in nature, successfully starting > tasks about 40-50% of the time, while the rest of the time getting stuck. > > > > We've been able to narrow this down to a possible race condition likely > in docker itself, but being triggered by the mesos-docker-executor. I have > written and tested a patch in our environment which seems to have > eliminated the issue, however I feel that the patch could be made more > robust, and is currently just a work-around. > > > > Thanks for your time and consideration of the issue. > > > > Travis > > -- Best Regards, Haosdent Huang