Thanks haosdent, that makes sense to allow the inspect command to time out on its own. Should Docker::inspect return to the caller on timeout, or follow a retry system like it does with a non-zero exit code? If the latter, should it retry forever if it times out every time?
Are you suggesting this in addition to the patch I have introducing the delay so far? I still believe the initial delay is important, as without it we would encounter the docker inspect timeout very frequently in our environment. Thanks, Travis -----Original Message----- From: haosdent [mailto:haosd...@gmail.com] Sent: Wednesday, February 3, 2016 10:05 AM To: dev <dev@mesos.apache.org> Subject: Re: MESOS-4581 Docker containers getting stuck in staging. Hi, Hegner. I think Joly means we change Docker::inspect from ``` s.get().status() .onAny([=]() { __inspect(cmd, promise, retryInterval, output, s.get()); }); ``` to ``` s.get().status() .after(timeoutSeconds, []() {...}) .onAny([=]() { __inspect(cmd, promise, retryInterval, output, s.get()); }); ``` Suppose if 500ms not works in different environments, we still would go into same problem. I think add timeout limit in inspect would be better. On Wed, Feb 3, 2016 at 10:05 PM, Hegner, Travis <theg...@trilliumit.com> wrote: > Thanks Jojy, > > I agree that my current patch is not the most ideal solution, as it > was intended as a proof of concept to ensure that it was in fact the > timing that was causing the problems I've been experiencing. Perhaps I > didn't make that clear enough in the JIRA. That is also the reason I > haven't attempted to submit that patch back. > > I was hoping to open a dialog to discuss the most appropriate fix for > that error, as I am relatively new to the future/promise paradigm, and > I haven't done much in c++ in 10 years. > > I'm assuming that when you say "using 'after' on the future" you are > referring to the initial inspect call's future correct? In that case, > I'm not convinced that is the right solution either. If you dig into > the inspect call, you'll see that it has a fairly sophisticated retry > system itself. However, that retry system is designed around whether > the actual inspect command completes its execution. If the inspect > command runs forever, then that thread blocks waiting for it forever. > > I am leaning towards a combination of both. I believe that it is > appropriate to wait a small amount of time to call the initial > inspect, to give docker a chance to create and start the container. I > see your point in that there is still the possibility of a race > condition, if the run command is delayed for the right amount of time > itself, I understand that the run thread could be executed any time > after the future is created. When running the vanilla mesos (without > my current patch), remember that the containers that launch > successfully, actually do so because the initial inspect fails > (returns non 0 as the container doesn’t exist yet), and the inspect > function's built in retry system delays 500ms and runs again, this time > successfully. (See these logs to see what I mean: > https://gist.github.com/travishegner/dcb5797f5a0919e45618) > > With that in mind, we wouldn't actually know whether the inspect is > just delayed in its own retry pattern, or is actually hung, unless we > wait a large enough amount of time. Given how frequently we encounter > this issue in our environment (more than half of the attempted > container launches), waiting a large amount of time to indicate > whether the container is running would be undesirable. > > How do you feel about an initial delay (how long 500ms?), in concert > with an ".after()" (how long 3000ms?) on the inspect future that will > discard the original inspect and try again? The initial delay would > give docker time to create and start our container, and in my > environment drastically reduce the frequency with which we are met > with this condition. Ironically, the variable name > (DOCKER_INSPECT_DELAY) of the retry delay, seems to indicate that the > author expected a delay to be inserted before the inspect call anyway, > but in reading through the code, that variable is only used as the retry > interval, and the initial docker inspect is executed immediately. > A timeout/retry on the inspect call helps to guard against this type > of issue, should the run and inspect commands be launched > simultaneously for whatever reason. > > A final thought: are we looking at this from the wrong perspective? > Should the inspect call itself be modified to handle the case when the > inspect command never returns? > > Thanks for your thoughts, > > Travis > > -----Original Message----- > From: Jojy Varghese [mailto:j...@mesosphere.io] > Sent: Tuesday, February 2, 2016 10:34 PM > To: dev@mesos.apache.org > Subject: Re: MESOS-4581 Docker containers getting stuck in staging. > > Hi Travis > Thanks for narrowing down the issue. I had a brief look at your patch > and it looks like it relies on adding delay before inspect is called. > Although that might work mostly, I am wondering if that is the right > solution. It would be better if we can have a timeout (using ‘after’ > on the future) and retry inspect after timeout. We will have to > discard the inspect future thats in flight. > > -Jojy > > > > On Feb 2, 2016, at 1:12 PM, Hegner, Travis <theg...@trilliumit.com> > wrote: > > > > I'd like to initiate a discussion on the following issue: > > > > https://issues.apache.org/jira/browse/MESOS-4581 > > > > I've included a lot of detail in the JIRA, and would rather not > reiterate _all_ of it here on the list, but in short: > > > > We are experiencing an issue when launching docker containers from > marathon on mesos, where the container actually starts on the slave > node to which it's assigned, but mesos/marathon get stuck in > staging/staged respectively until the task launch times out and system > tries again to launch it elsewhere. This issue is random in nature, > successfully starting tasks about 40-50% of the time, while the rest of the > time getting stuck. > > > > We've been able to narrow this down to a possible race condition > > likely > in docker itself, but being triggered by the mesos-docker-executor. I > have written and tested a patch in our environment which seems to have > eliminated the issue, however I feel that the patch could be made more > robust, and is currently just a work-around. > > > > Thanks for your time and consideration of the issue. > > > > Travis > > -- Best Regards, Haosdent Huang