RE: MESOS-4581 Docker containers getting stuck in staging.

Hegner, Travis Wed, 03 Feb 2016 07:41:03 -0800

Thanks haosdent, that makes sense to allow the inspect command to time out on 
its own. Should Docker::inspect return to the caller on timeout, or follow a 
retry system like it does with a non-zero exit code? If the latter, should it 
retry forever if it times out every time?


Are you suggesting this in addition to the patch I have introducing the delay 
so far? I still believe the initial delay is important, as without it we would 
encounter the docker inspect timeout very frequently in our environment.

Thanks,

Travis

-----Original Message-----
From: haosdent [mailto:[email protected]] 
Sent: Wednesday, February 3, 2016 10:05 AM
To: dev <[email protected]>
Subject: Re: MESOS-4581 Docker containers getting stuck in staging.

Hi, Hegner. I think Joly means we change Docker::inspect from ```
  s.get().status()
    .onAny([=]() { __inspect(cmd, promise, retryInterval, output, s.get()); }); 
``` to ```
  s.get().status()
    .after(timeoutSeconds, []() {...})
    .onAny([=]() { __inspect(cmd, promise, retryInterval, output, s.get()); }); 
```

Suppose if 500ms not works in different environments, we still would go into 
same problem. I think add timeout limit in inspect would be better.

On Wed, Feb 3, 2016 at 10:05 PM, Hegner, Travis <[email protected]>
wrote:

> Thanks Jojy,
>
> I agree that my current patch is not the most ideal solution, as it 
> was intended as a proof of concept to ensure that it was in fact the 
> timing that was causing the problems I've been experiencing. Perhaps I 
> didn't make that clear enough in the JIRA. That is also the reason I 
> haven't attempted to submit that patch back.
>
> I was hoping to open a dialog to discuss the most appropriate fix for 
> that error, as I am relatively new to the future/promise paradigm, and 
> I haven't done much in c++ in 10 years.
>
> I'm assuming that when you say "using 'after' on the future" you are 
> referring to the initial inspect call's future correct? In that case, 
> I'm not convinced that is the right solution either. If you dig into 
> the inspect call, you'll see that it has a fairly sophisticated retry 
> system itself. However, that retry system is designed around whether 
> the actual inspect command completes its execution. If the inspect 
> command runs forever, then that thread blocks waiting for it forever.
>
> I am leaning towards a combination of both. I believe that it is 
> appropriate to wait a small amount of time to call the initial 
> inspect, to give docker a chance to create and start the container. I 
> see your point in that there is still the possibility of a race 
> condition, if the run command is delayed for the right amount of time 
> itself, I understand that the run thread could be executed any time 
> after the future is created. When running the vanilla mesos (without 
> my current patch), remember that the containers that launch 
> successfully, actually do so because the initial inspect fails 
> (returns non 0 as the container doesn’t exist yet), and the inspect 
> function's built in retry system delays 500ms and runs again, this time 
> successfully. (See these logs to see what I mean:
> https://gist.github.com/travishegner/dcb5797f5a0919e45618)
>
> With that in mind, we wouldn't actually know whether the inspect is 
> just delayed in its own retry pattern, or is actually hung, unless we 
> wait a large enough amount of time. Given how frequently we encounter 
> this issue in our environment (more than half of the attempted 
> container launches), waiting a large amount of time to indicate 
> whether the container is running would be undesirable.
>
> How do you feel about an initial delay (how long 500ms?), in concert 
> with an ".after()" (how long 3000ms?) on the inspect future that will 
> discard the original inspect and try again? The initial delay would 
> give docker time to create and start our container, and in my 
> environment drastically reduce the frequency with which we are met 
> with this condition. Ironically, the variable name 
> (DOCKER_INSPECT_DELAY) of the retry delay, seems to indicate that the 
> author expected a delay to be inserted before the inspect call anyway, 
> but in reading through the code, that variable is only used as the retry 
> interval, and the initial docker inspect is executed immediately.
> A timeout/retry on the inspect call helps to guard against this type 
> of issue, should the run and inspect commands be launched 
> simultaneously for whatever reason.
>
> A final thought: are we looking at this from the wrong perspective? 
> Should the inspect call itself be modified to handle the case when the 
> inspect command never returns?
>
> Thanks for your thoughts,
>
> Travis
>
> -----Original Message-----
> From: Jojy Varghese [mailto:[email protected]]
> Sent: Tuesday, February 2, 2016 10:34 PM
> To: [email protected]
> Subject: Re: MESOS-4581 Docker containers getting stuck in staging.
>
> Hi Travis
>  Thanks for narrowing down the issue. I had a brief look at your patch 
> and it looks like it relies on adding delay before inspect is called. 
> Although that might work mostly, I am wondering if that is the right 
> solution. It would be better if we can have a timeout (using ‘after’ 
> on the future) and retry inspect after timeout. We will have to 
> discard the inspect future thats in flight.
>
> -Jojy
>
>
> > On Feb 2, 2016, at 1:12 PM, Hegner, Travis <[email protected]>
> wrote:
> >
> > I'd like to initiate a discussion on the following issue:
> >
> > https://issues.apache.org/jira/browse/MESOS-4581
> >
> > I've included a lot of detail in the JIRA, and would rather not
> reiterate _all_ of it here on the list, but in short:
> >
> > We are experiencing an issue when launching docker containers from
> marathon on mesos, where the container actually starts on the slave 
> node to which it's assigned, but mesos/marathon get stuck in 
> staging/staged respectively until the task launch times out and system 
> tries again to launch it elsewhere. This issue is random in nature, 
> successfully starting tasks about 40-50% of the time, while the rest of the 
> time getting stuck.
> >
> > We've been able to narrow this down to a possible race condition 
> > likely
> in docker itself, but being triggered by the mesos-docker-executor. I 
> have written and tested a patch in our environment which seems to have 
> eliminated the issue, however I feel that the patch could be made more 
> robust, and is currently just a work-around.
> >
> > Thanks for your time and consideration of the issue.
> >
> > Travis
>
>


--
Best Regards,
Haosdent Huang

RE: MESOS-4581 Docker containers getting stuck in staging.

Reply via email to